Here's the thing.
Here is a large word list. I want to crawl some data according to these words. It's time-consuming so I'd like to split it into pieces.
First, I load a list of words into a list in __init__
of my spider.
def __init__(self, category=None, *args, **kwargs):
super(GlosbeSpider, self).__init__(*args, **kwargs)
list_file = open('glosbe/vi/word4/word_list_4', 'r')
for lines in list_file:
lines = lines.strip()
self.word_list.append(lines)
list_file.close()
print 'INIT!!!!!'
Then I create some initial requests in start_requests()
:
def start_requests(self):
container = []
for word in self.word_list:
url = "https://glosbe.com/gapi/tm?from=zh&dest=%s&format=json&phrase=%s&page=%d&pretty=true" % (
self.language, word, 1)
meta_info = {'page_num': 1, 'word': word}
new_req = scrapy.Request(url, callback=self.parse_json, meta=meta_info, dont_filter=True,
errback=self.process_error)
container.append(new_req)
print 'START_REQUESTS!!!!!'
return container
And I parse items in parse_json()
(code is omitted here, not important).
According to the official document, if I use the same command twice in the shell like:
scrapy crawl MySpider -s JOBDIR=dir_I_want_to_use
then the crawler will continue its work from where it stops.
However, when I resume the job using the same command above, there're still
INIT!!!!!
START_REQUESTS!!!!!
on the screen. Why? I think it should continue its parsing process without calling start_requests()
.
If I want to continue my crawling job from where I stopped, how could I deal with it? Thanks.