I'm trying to scrape data from apartment links from this Chinese website
Thing is, every link I follow seems to go through a redirecting page to prevent me from scraping it. When I click on a link from Chrome, for example this link, everything's fine, the redirecting page loads super fast and I reach the apartment description page. But when my spider goes on it, all the response it can give me is about the redirect page (its HTML title is 跳转, which means "redirecting" according to google translate). I want my spider to behave as a normal user would, that is to wait for the redirecting page to resolve and go on to reach its destination.
I'm quite the beginner in web scraping, but I followed the Scrapy tutorials and these topics :
- how to handle 302 redirect in scrapy
- scrapy- how to stop Redirect (302) or even
- https://hub.packtpub.com/4-common-challenges-web-scraping-handle
However these teach me to stop redirecting, while it seems to me my spider precisely won't reach its destination if it doesn't follow that redirecting link. I also looked up http://scraping.pro/7-ways-protect-website-scraping-bypass-protection and couldn't find anything that matches the defense mechanism utilized by this Chinese website.
Here is what my spider looks like :
import scrapy
from scrapy.spiders import CrawlSpider
class YangzhouSpider(CrawlSpider):
name = 'fangtry'
allowed_domains = ['fang.com']
start_urls = ['https://yz.esf.fang.com']
def parse(self, response):
print("This is HTML for main page : \n", response.text)
# this should match every apartment
all_apartments = response.xpath("//dl[@dataflag='bg']")
# this gets link for the first apartment :
first_apartment_link = all_apartments[0].xpath(".//h4[@class='clearfix']/a/@href").get()
# ------------
# ╰---> equals '/chushou/3_369807146.htm'
follow_url = response.urljoin(first_apartment_link)
# ------
# ╰---> equals 'https://yz.esf.fang.com/chushou/3_369807146.htm'
yield scrapy.Request(follow_url, callback=self.parse_detail)
def parse_detail(self, response):
crappy_html = response.text
print("this HTML is bad : \n", crappy_html)
exit(420)
The HTML for the main page is fine, but that crappy_html
has no information about the page I'm interested in. Console gives me
2019-07-22 16:34:06 [fangtry] INFO: Spider opened: fangtry
2019-07-22 16:34:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-22 16:34:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://yz.esf.fang.com> (referer: None)
This is HTML for main page :
<!DOCTYPE html>
<html>
...
</html>
and for the crappy part :
2019-07-22 16:34:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://search.fang.com/captcha-verify/redirect?h=https://yz.esf.fang.com/chushou/3_373806230.htm> from <GET https://yz.esf.fang.com/chushou/3_373806230.htm>
2019-07-22 16:34:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://search.fang.com/captcha-verify/redirect?h=https://yz.esf.fang.com/chushou/3_373806230.htm> (referer: None)
this HTML is bad :
<html xmlns="http://www.w3.org/1999/xhtml" lang="UTF-8"><head>
<meta name="mobile-agent" content="format=html5;url=https://m.fang.com/news/bj.html">
<meta http-equiv="content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>跳转...</title>
...
</html>
So there you have it. Apologies if it turns out this question has already been answered in another section, I swear I tried to find a matching topic but couldn't find any. Any suggestions are greatly appreciated.