1

Currently, I can get endless crawled links from softpedia.com (including the desired installer links, such as http://hotdownloads.com/trialware/download/Download_a1keylogger.zip?item=33649-3&affiliate=22260).

The spider.py is as follows:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class MySpider(CrawlSpider):
    """ Crawl through web sites you specify """
    name = "softpedia"

    # Stay within these domains when crawling
    allowed_domains = ["www.softpedia.com"]

    start_urls = [
    "http://win.softpedia.com/",]

    download_delay = 2

    # Add our callback which will be called for every found link
    rules = [
            Rule(SgmlLinkExtractor(), follow=True)
    ]

items.py, pipelines.py, settings.py comes as default, except an added line to settings.py:

FILES_STORE = '/home/test/softpedia/downloads'

Using urllib2, I'm able to tell whether a link is an installer or not, in this case I get 'application' in content_type:

>>> import urllib2
>>> url = 'http://hotdownloads.com/trialware/download/Download_a1keylogger.zip?item=33649-3&affiliate=22260'
>>> response = urllib2.urlopen(url)
>>> content_type = response.info().get('Content-Type')
>>> print content_type
application/zip

My question is, how to get gather the desired installer links, and download them to my destination folder? Thanks in advance!

PS:

I found 2 methods for now, but I cannot get them working:

1.https://stackoverflow.com/a/7169241/2092480, I followed this answer by adding the following code to the spider:

def parse_installer(self, response):
    # extract links
    lx = SgmlLinkExtractor()  
    urls = lx.extract_links(response)
    for url in urls:
        yield Request(url, callback=self.save_installer)

def save_installer(self, response):
    path = self.get_path(response.url)
    with open(path, "wb") as f: # or using wget
        f.write(response.body)

The spider just goes as these codes never exist and I get no downloaded files, can someone see where went wrong?

2.https://groups.google.com/forum/print/msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ, this method itself is working when I provided the pre-defined links in the ["file_urls"]. But how to set scrapy to gather all the installer links to ["file_urls"]? In addition, I guess for such easy task, the above method should be sufficient enough.

Community
  • 1
  • 1
Deming
  • 1,210
  • 12
  • 15
  • Are you saving the fully qualified urls to the `file_urls` field? – R. Max Nov 05 '13 at 15:51
  • I want to do so, so I can use the FilesPipeline to download (the second method), but I haven't figured out how to collect the urls? – Deming Nov 05 '13 at 18:13

1 Answers1

1

I combined 2 methods mentioned to obtain Actual/Mirror Installer downloads, then use File download pipeline to do actual download.However, it does not seem to work if the file download URL is dynamic/complex e.g. http://www.softpedia.com/dyn-postdownload.php?p=00000&t=0&i=1. But it'll work for simpler links e.g. http://www.ietf.org/rfc/rfc2616.txt

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.contrib.loader import XPathItemLoader
from scrapy import log
from datetime import datetime 
from scrapy.conf import settings
from myscraper.items import SoftpediaItem

class SoftpediaSpider(CrawlSpider):
name = "sosoftpedia"
allowed_domains = ["www.softpedia.com"]
start_urls = ['http://www.softpedia.com/get/Antivirus/']
rules = Rule(SgmlLinkExtractor(allow=('/get/', ),allow_domains=("www.softpedia.com"), restrict_xpaths=("//td[@class='padding_tlr15px']",)), callback='parse_links', follow=True,),


def parse_start_url(self, response):
    return self.parse_links(response)

def parse_links(self, response):
    print "PRODUCT DOWNLOAD PAGE: "+response.url
    hxs = HtmlXPathSelector(response)
    urls = hxs.select("//a[contains(@itemprop, 'downloadURL')]/@href").extract()
    for url in urls:
        item = SoftpediaItem()
        request =  Request(url=url, callback=self.parse_downloaddetail) 
        request.meta['item'] = item
        yield request

def parse_downloaddetail(self, response):
    item = response.meta['item']
    hxs = HtmlXPathSelector(response)
    item["file_urls"] = hxs.select('//p[@class="fontsize16"]/b/a/@href').extract() #["http://www.ietf.org/rfc/rfc2616.txt"]
    print "ACTUAL DOWNLOAD LINKS "+ hxs.select('//p[@class="fontsize16"]/b/a/@href').extract()[0]
    yield item
totoro
  • 195
  • 1
  • 9