Currently, I can get endless crawled links from softpedia.com (including the desired installer links, such as http://hotdownloads.com/trialware/download/Download_a1keylogger.zip?item=33649-3&affiliate=22260).
The spider.py is as follows:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class MySpider(CrawlSpider):
""" Crawl through web sites you specify """
name = "softpedia"
# Stay within these domains when crawling
allowed_domains = ["www.softpedia.com"]
start_urls = [
"http://win.softpedia.com/",]
download_delay = 2
# Add our callback which will be called for every found link
rules = [
Rule(SgmlLinkExtractor(), follow=True)
]
items.py, pipelines.py, settings.py comes as default, except an added line to settings.py:
FILES_STORE = '/home/test/softpedia/downloads'
Using urllib2, I'm able to tell whether a link is an installer or not, in this case I get 'application' in content_type:
>>> import urllib2
>>> url = 'http://hotdownloads.com/trialware/download/Download_a1keylogger.zip?item=33649-3&affiliate=22260'
>>> response = urllib2.urlopen(url)
>>> content_type = response.info().get('Content-Type')
>>> print content_type
application/zip
My question is, how to get gather the desired installer links, and download them to my destination folder? Thanks in advance!
PS:
I found 2 methods for now, but I cannot get them working:
1.https://stackoverflow.com/a/7169241/2092480, I followed this answer by adding the following code to the spider:
def parse_installer(self, response):
# extract links
lx = SgmlLinkExtractor()
urls = lx.extract_links(response)
for url in urls:
yield Request(url, callback=self.save_installer)
def save_installer(self, response):
path = self.get_path(response.url)
with open(path, "wb") as f: # or using wget
f.write(response.body)
The spider just goes as these codes never exist and I get no downloaded files, can someone see where went wrong?
2.https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ, this method itself is working when I provided the pre-defined links in the ["file_urls"]. But how to set scrapy to gather all the installer links to ["file_urls"]? In addition, I guess for such easy task, the above method should be sufficient enough.