1

Here's the thing.
Here is a large word list. I want to crawl some data according to these words. It's time-consuming so I'd like to split it into pieces.

First, I load a list of words into a list in __init__ of my spider.

def __init__(self, category=None, *args, **kwargs):
    super(GlosbeSpider, self).__init__(*args, **kwargs)
    list_file = open('glosbe/vi/word4/word_list_4', 'r')
    for lines in list_file:
        lines = lines.strip()
        self.word_list.append(lines)
    list_file.close()
    print 'INIT!!!!!'

Then I create some initial requests in start_requests():

def start_requests(self):
    container = []
    for word in self.word_list:
        url = "https://glosbe.com/gapi/tm?from=zh&dest=%s&format=json&phrase=%s&page=%d&pretty=true" % (
            self.language, word, 1)
        meta_info = {'page_num': 1, 'word': word}
        new_req = scrapy.Request(url, callback=self.parse_json, meta=meta_info, dont_filter=True,
                                 errback=self.process_error)
        container.append(new_req)
    print 'START_REQUESTS!!!!!'
    return container

And I parse items in parse_json()(code is omitted here, not important).

According to the official document, if I use the same command twice in the shell like:

scrapy crawl MySpider -s JOBDIR=dir_I_want_to_use

then the crawler will continue its work from where it stops.

However, when I resume the job using the same command above, there're still

INIT!!!!!
START_REQUESTS!!!!!

on the screen. Why? I think it should continue its parsing process without calling start_requests().
If I want to continue my crawling job from where I stopped, how could I deal with it? Thanks.

Pacific_73
  • 11
  • 2
  • I don't think there's anything wrong here. The spiders have to be intiated on resume. – Granitosaurus Aug 02 '17 at 19:29
  • I start to gradually understand this process after a whole day's work. Thank you. It seems that I need to figure out a way to record the progress while avoiding the re-requests of these finished requests after resuming. – Pacific_73 Aug 03 '17 at 03:24

0 Answers0