I have 2 different Scrapy
spiders currently working when launched with:
scrapy crawl spidername -o data\whatever.json
Of course I know I can use a system call from the script to replicate just that command, but I would prefer sticking to CrawlerProcess
usage or any other method of making it work from a script.
The thing is: as read in this SO question (and also in Scrapy docs), I have to set the output file in the settings given to the CrawlerProcess
constructor:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': 'data.json'
})
The problem is that I don't want both spiders to be storing data into the same output file, but two different files. So my first attempt was obviously creating a new CrawlerProcess
with different settings when the first job is done:
session_date_format = '%Y%m%d'
session_date = datetime.now().strftime(session_date_format)
try:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': os.path.join('data', 'an_origin', '{}.json'.format(session_date)),
'DOWNLOAD_DELAY': 3,
'LOG_STDOUT': True,
'LOG_FILE': 'scrapy_log.txt',
'ROBOTSTXT_OBEY': False,
'RETRY_ENABLED': True,
'RETRY_HTTP_CODES': [500, 503, 504, 400, 404, 408],
'RETRY_TIMES': 5
})
process.crawl(MyFirstSpider)
process.start() # the script will block here until the crawling is finished
except Exception as e:
print('ERROR while crawling: {}'.format(e))
else:
print('Data successfuly crawled')
time.sleep(3) # Wait 3 seconds
try:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': os.path.join('data', 'other_origin', '{}.json'.format(session_date)),
'DOWNLOAD_DELAY': 3,
'LOG_STDOUT': True,
'LOG_FILE': 'scrapy_log.txt',
'ROBOTSTXT_OBEY': False,
'RETRY_ENABLED': True,
'RETRY_HTTP_CODES': [500, 503, 504, 400, 404, 408],
'RETRY_TIMES': 5
})
process.crawl(MyOtherSpider)
process.start() # the script will block here until the crawling is finished
except Exception as e:
print('ERROR while crawling: {}'.format(e))
else:
print('Data successfuly crawled')
When I do this, the first Crawler
works as expected. But then, the second one creates an empty output file and fails. This also happens if I store the second CrawlerProcess
into a different variable, such as process2
. Obviously, I tried changing the order of the Spiders to check if this was a problem of that particular Spider, but the one failing is always the one going the second place.
If I inspect the log file, after the first job is done, it seems that 2 Scrapy bots are started, so maybe something weird is happening:
2017-05-29 23:51:41 [scrapy.extensions.feedexport] INFO: Stored json feed (2284 items) in: data\one_origin\20170529.json
2017-05-29 23:51:41 [scrapy.core.engine] INFO: Spider closed (finished)
2017-05-29 23:51:41 [stdout] INFO: Data successfuly crawled
2017-05-29 23:51:44 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: scrapybot)
2017-05-29 23:51:44 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: scrapybot)
2017-05-29 23:51:44 [scrapy.utils.log] INFO: Overridden settings: {'LOG_FILE': 'scrapy_output.txt', 'FEED_FORMAT': 'json', 'FEED_URI': 'data\\other_origin\\20170529.json', 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 'LOG_STDOUT': True, 'RETRY_TIMES': 5, 'RETRY_HTTP_CODES': [500, 503, 504, 400, 404, 408], 'DOWNLOAD_DELAY': 3}
2017-05-29 23:51:44 [scrapy.utils.log] INFO: Overridden settings: {'LOG_FILE': 'scrapy_output.txt', 'FEED_FORMAT': 'json', 'FEED_URI': 'data\\other_origin\\20170529.json', 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', 'LOG_STDOUT': True, 'RETRY_TIMES': 5, 'RETRY_HTTP_CODES': [500, 503, 504, 400, 404, 408], 'DOWNLOAD_DELAY': 3}
...
2017-05-29 23:51:44 [scrapy.core.engine] INFO: Spider opened
2017-05-29 23:51:44 [scrapy.core.engine] INFO: Spider opened
2017-05-29 23:51:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-29 23:51:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-29 23:51:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-05-29 23:51:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-05-29 23:51:44 [stdout] INFO: ERROR while crawling:
2017-05-29 23:51:44 [stdout] INFO: ERROR while crawling:
Any idea of what's happening and how to fix this?