1

I'm trying to scrape data from apartment links from this Chinese website

Thing is, every link I follow seems to go through a redirecting page to prevent me from scraping it. When I click on a link from Chrome, for example this link, everything's fine, the redirecting page loads super fast and I reach the apartment description page. But when my spider goes on it, all the response it can give me is about the redirect page (its HTML title is 跳转, which means "redirecting" according to google translate). I want my spider to behave as a normal user would, that is to wait for the redirecting page to resolve and go on to reach its destination.

I'm quite the beginner in web scraping, but I followed the Scrapy tutorials and these topics :

However these teach me to stop redirecting, while it seems to me my spider precisely won't reach its destination if it doesn't follow that redirecting link. I also looked up http://scraping.pro/7-ways-protect-website-scraping-bypass-protection and couldn't find anything that matches the defense mechanism utilized by this Chinese website.

Here is what my spider looks like :

import scrapy
from scrapy.spiders import CrawlSpider

class YangzhouSpider(CrawlSpider):
    name = 'fangtry'
    allowed_domains = ['fang.com']
    start_urls = ['https://yz.esf.fang.com']

def parse(self, response):
    print("This is HTML for main page : \n", response.text)

    # this should match every apartment
    all_apartments = response.xpath("//dl[@dataflag='bg']")

    # this gets link for the first apartment :
    first_apartment_link = all_apartments[0].xpath(".//h4[@class='clearfix']/a/@href").get()
    #  ------------
    #       ╰---> equals '/chushou/3_369807146.htm'

    follow_url = response.urljoin(first_apartment_link)
    # ------
    #   ╰---> equals 'https://yz.esf.fang.com/chushou/3_369807146.htm'

    yield scrapy.Request(follow_url, callback=self.parse_detail)

def parse_detail(self, response):
    crappy_html = response.text
    print("this HTML is bad : \n", crappy_html)
    exit(420)

The HTML for the main page is fine, but that crappy_html has no information about the page I'm interested in. Console gives me

2019-07-22 16:34:06 [fangtry] INFO: Spider opened: fangtry
2019-07-22 16:34:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-22 16:34:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://yz.esf.fang.com> (referer: None)
This is HTML for main page : 

<!DOCTYPE html>
<html>
...
</html>

and for the crappy part :

2019-07-22 16:34:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://search.fang.com/captcha-verify/redirect?h=https://yz.esf.fang.com/chushou/3_373806230.htm> from <GET https://yz.esf.fang.com/chushou/3_373806230.htm>
2019-07-22 16:34:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://search.fang.com/captcha-verify/redirect?h=https://yz.esf.fang.com/chushou/3_373806230.htm> (referer: None)
this HTML is bad : 
 <html xmlns="http://www.w3.org/1999/xhtml" lang="UTF-8"><head>
<meta name="mobile-agent" content="format=html5;url=https://m.fang.com/news/bj.html">
<meta http-equiv="content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>跳转...</title>
...
</html>

So there you have it. Apologies if it turns out this question has already been answered in another section, I swear I tried to find a matching topic but couldn't find any. Any suggestions are greatly appreciated.

Simon Q.
  • 11
  • 5
  • 1
    You can see this answer https://stackoverflow.com/questions/22795416/how-to-handle-302-redirect-in-scrapy Often redirect information is in the response headers. Check the response headers and see if there is a redirect page. Usually a redirect will be a 302 redirect and happen automatically within scrapy. – ThePyGuy Jul 25 '19 at 06:01

1 Answers1

0

Sadly I found no easy way to overcome this issue. I recommend using a RPA if you need complex scraping features rather than doing it from scratch. ScrapeStorm worked very well for me in this case.

Simon Q.
  • 11
  • 5