I have created a new Scrapy spider that is extremely slow. It only scrapes around two pages per second, whereas the other Scrapy crawlers that I have created have been crawling a lot faster.
I was wondering what is it that could cause this issue, and how to possibly fix that. The code is not very different from the other spiders and I am not sure if it is related to the issue, but I'll add it if you think it may be involved.
In fact, I have the impression that the requests are not asynchronous. I have never run into this kind of problem, and I am fairly new to Scrapy.
EDIT
Here's the spider :
class DatamineSpider(scrapy.Spider):
name = "Datamine"
allowed_domains = ["domain.com"]
start_urls = (
'http://www.example.com/en/search/results/smth/smth/r101/m2108m',
)
def parse(self, response):
for href in response.css('.searchListing_details .search_listing_title .searchListing_title a::attr("href")'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_stuff)
next_page = response.css('.pagination .next a::attr("href")')
next_url = response.urljoin(next_page.extract()[0])
yield scrapy.Request(next_url, callback=self.parse)
def parse_stuff(self, response):
item = Item()
item['value'] = float(response.xpath('//*[text()="Price" and not(@class)]/../../div[2]/span/text()').extract()[0].split(' ')[1].replace(',',''))
item['size'] = float(response.xpath('//*[text()="Area" and not(@class)]/../../div[2]/text()').extract()[0].split(' ')[0].replace(',', '.'))
try:
item['yep'] = float(response.xpath('//*[text()="yep" and not(@class)]/../../div[2]/text()').extract()[0])
except IndexError:
print "NO YEP"
else:
yield item