scrapy spider: output in chronological order

Question

I am programming a web crawler in python with Scrapy. The purpose is to monitor changes to a webpage at pre-determined time intervals. After logging in to the website, the spider requests a web page every X minutes and certain data is extracted from the page and saved to a text file. It turns out that the text file is written only when the spider closes, and the lines in the text files are not in chronological order. I can't figure it out what is happening. Maybe it's a specific way of working of the Scrapy module? Any ideas?

import scrapy
from scrapy.http import Request
from scrapy.http import FormRequest
from scraping_example.loginform import fill_login_form
from datetime import datetime
import time


class ExampleSpiderSpider(scrapy.Spider):
    name = 'example_spider'
    allowed_domains = ['example.com']
    start_urls = ['http:/www.example.com/login']
    login_user = 'edging780'
    login_pass = ''

    def parse(self, response):
        (args, url, method) = fill_login_form(response.url,
                response.body, self.login_user, self.login_pass)
        return FormRequest(url, method=method, formdata=args,
                           callback=self.after_login)

    def after_login(self, response):
        for i in range(0,6):
            request = Request('https://www.example.com/page_to_scrape', callback=self.get_table, dont_filter = True)
            request.meta['dateTime'] = str(datetime.now())
            request.meta['order'] = str(i)
            yield request
            time.sleep(600)
        return

    def get_table(self, response):
        table = response.xpath('//table[@class="example_table"]/tbody/tr[not(contains(@class,"thead"))]') 
        Data=[]
        for n_row in range(0,len(table)):
            row = table[n_row]
            Data.append(row.xpath('td[1]/text()').extract())    

        dictionary = {'Time': response.meta['dateTime'],
                 'Order': response.meta['order'],
                 'Data': Data}            
        with open('output.txt', 'a') as f:
            f.write(str(dictionary) + '\n')
        return

using `after_login` as a generator works as intended? I could not find that in the docs: https://doc.scrapy.org/en/latest/topics/request-response.html, https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.FormRequest, https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments, https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.FormRequest.from_response, https://stackoverflow.com/questions/5850755/using-scrapy-with-authenticated-logged-in-user-session, ... — serv-inc, Nov 16 '17 at 20:01
I am not sure that I understand your question.. it's probably because I have started programming only a few months ago. if you refer to using `yield` and `return` within the `after login` method, I have taken it from some examples found online. The code works, except that the lines in 'output.txt' are not in chronological order — edding780, Nov 16 '17 at 20:12
Could you link to such an example? (`yield` creates a generator expression, where `.next()` provides the next value. This is different from `return`.) — serv-inc, Nov 17 '17 at 12:02

score 3 · Answer 1 · edited Nov 18 '17 at 08:27

You might want to read this: https://doc.scrapy.org/en/latest/faq.html#does-scrapy-crawl-in-breadth-first-or-depth-first-order

and this: LIFO (last in, first out)

Scrapy don't handle the requests in the order you give to him, but you can change this behaviour (you have the option described in the link above).

Also, you might want to consider using the Items and the feed exporters instead of handling your items like you do...

Edit: On top of:

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

I also needed to set

CONCURRENT_REQUESTS = 1

The latter setting is to make requests one by one

scrapy spider: output in chronological order

1 Answers1

Linked