0

I am trying to crawl booking from a VM and I don't get the same response like the one from my local machine. The query is the following:

scrapy shell --set="ROBOTSTXT_OBEY=False" -s USER_AGENT="Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0" "https://www.booking.com/hotel/fr/le-transat-bleu.fr.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNiBW5vcmVmaE2IAQGYAQ3CAQp3aW5kb3dzIDEwyAEM2AEB6AEB-AELkgIBeagCAw;sid=746d95cb38d6de7fbb5a878954481e7b;all_sr_blocks=33843609_122840412_1_2_0;checkin=2019-03-17;checkout=2019-03-18;dest_id=-1424668;dest_type=city;dist=0;group_adults=1;group_children=0;hapos=1;highlighted_blocks=33843609_122840412_1_2_0;hpos=1;req_adults=1;req_children=0;room1=A%2C;sb_price_type=total;sr_order=popularity;srepoch=1550502677;srpvid=26936aca347f0334;type=total;ucfs=1&#hotelTmpl"

When I run the query from my VM, I get a response with the same URL than the one in the query while from the VM I get the generic response:

https://www.booking.com/hotel/fr/le-transat-bleu.fr.html

I must mention that before adding the USER_AGENT part I was getting the same answer even on my local machine.

Also, if I use Links, a command-line browser from the VM, I get the correct response. Hence it does not seem to come from the public IP of the VM I use.

I suspect that there is another information that booking.com might be using to prevent the crawling of certain pages on top of the USER_AGENT and the robot.txt file but I don't know which one.

Local Request Headers

{b'Accept': b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*; q=0.8', b'Accept-Language': b'en', b'User-Agent': b'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0', b'Accept-Encoding': b'gzip,deflate'}

VM Request Headers

{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*; q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'], b'Accept-Encoding': [b'gzip,deflate'], b'Cookie': [b'bkng=11UmFuZG9tSVYkc2RlIyh9Yaa29%2F3xUOLbXpFeYC4TUhBTLg%2BWRWQhTWxLpR01uuU40DSTIBsY%2F5OusQaibxVABBhdPCiYlEsnGLdmcDyD%2BtWFGVlewF8Fo59TLNV6vs0R1Ypha9MOkYUl6wASmexLrJie%2F3imTygdbEEsnB0sv0m%2B%2FJ1C6Cm42FEFBT222yQ7']}

VM Request without cookies

scrapy shell --set="COOKIES_ENABLED=False" --set="ROBOTSTXT_OBEY=False" -s USER_AGENT="Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0" "https://www.booking.com/hotel/fr/le-transat-bleu.fr.html?aid=304142;label=gen173nr-1FCAEoggJCAlhYSDNiBW5vcmVmaE2IAQGYAQ3CAQp3aW5kb3dzIDEwyAEM2AEB6AEB-AELkgIBeagCAw;sid=746d95cb38d6de7fbb5a878954481e7b;all_sr_blocks=33843609_122840412_1_2_0;checkin=2019-03-17;checkout=2019-03-18;dest_id=-1424668;dest_type=city;dist=0;group_adults=1;group_children=0;hapos=1;highlighted_blocks=33843609_122840412_1_2_0;hpos=1;req_adults=1;req_children=0;room1=A%2C;sb_price_type=total;sr_order=popularity;srepoch=1550502677;srpvid=26936aca347f0334;type=total;ucfs=1&#hotelTmpl"

VM Request Headers without cookies

{b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'], b'Accept-Encoding': [b'gzip,deflate']}
Yohan Obadia
  • 2,552
  • 2
  • 24
  • 31
  • Are both environments using the same version of Python and libraries? What operating system is on your host machine and what is on the VM? – malberts Feb 19 '19 at 10:27
  • My local machine is a Windows10 and the VM is an Amazon EC2 instance. Both have python3.7. I don't see how this could have any impact though. It is what booking chooses to answer that bothers me since I send as a User Agent that I am an Android mobile device. I suspect that another information is passed that betrays scrapy. – Yohan Obadia Feb 19 '19 at 12:29
  • I'm asking in case there is some underlying difference that causes additional/different request headers. So I would suggest you print `response.request.headers` in both cases to check what they are sending differently. – malberts Feb 19 '19 at 13:05
  • I just added the request headers. I notice a cookie on the VM, I'll check without one. Thanks for pointing me in this direction. – Yohan Obadia Feb 19 '19 at 13:12
  • Just checked and updated the info I get without cookies. If you see anything else I'd like your input. – Yohan Obadia Feb 19 '19 at 13:19
  • Do you have the same code in the folder on both systems? Maybe check if you have differences in `settings.py`. That might account for why one is sending cookies. – malberts Feb 19 '19 at 13:51
  • There are no cookies sent anymore. The only 2 differences I spot now are the `q=0.9` vs `q=0.8` and the fact that the json values of the header are stored in lists for the VM. What does the `q=0.8` means btw ? – Yohan Obadia Feb 19 '19 at 13:54
  • Does it work correctly now without the cookies? Or are those other differences still an issue? As for `q`, have a look a https://stackoverflow.com/a/10496722/ – malberts Feb 19 '19 at 14:49
  • The problem remains even without cookies... – Yohan Obadia Feb 19 '19 at 15:14
  • I don't know why the headers would differ like that, so this is isn't a proper solution, but try to reconstruct all the headers so they look exactly like the working set. If that still causes a problem then I'm out of ideas. – malberts Feb 19 '19 at 15:18
  • 1
    Major websites can make different responses depending from sender IP.(residential IP (local), datacenter IP (AmazonEC2) or/and IP from different countries), even If requests are directly the same. – Georgiy Feb 19 '19 at 15:24
  • @Georgiy that is what I suspect, however, I tried to use the command-line browser Links from the EC2 instance and got the correct answer. If that was not the case I would suspect the IP but for that reason I am left doubting... Anyway thank you both for the time your spending on this ! – Yohan Obadia Feb 20 '19 at 14:25

0 Answers0