We run scrapy 2.1.0 and scrapyd in python 3.6 on ubuntu 18.04 and I ran into a problem that I need help understanding how to solve the right way. I'm new to python (coming from other languages) so please speak slowly and loudly so I understand =)
The problem is that a job can get stuck as "pending" in the scrapyd schedule. When reading the latest log messages using systemctl status scrapyd.service
it shows a deprecation warning.
usr/local/lib/python3.6/dist-packages/scrapy/utils/project.py:94: ScrapyDeprecationWarning: Use of environment variables prefixed with SCRAPY_ to override settings is deprecated. The following environment variables are currently defined: JOB, LOG_FILE, SLOT, SPIDER
After reading up on this issue I understand the nature of the problem and my guess is (please correct me here) that scrapyd define these environment variables in runtime, which triggers the warning in scrapy, that then stops scrapyd from continuing.
I can define LOG_FILE in the scrapy project settings.py but the other three feels very specific to scrapyd runtime/schedule and are not part of a scrapy project settings file.
My ugly workaround that seems to work is to add these variables to the list in /scrapy/utils/project.py
on line 80.
scrapy_envvars = {k[7:]: v for k, v in os.environ.items() if
k.startswith('SCRAPY_')}
valid_envvars = {
'CHECK',
'PICKLED_SETTINGS_TO_OVERRIDE',
'PROJECT',
'PYTHON_SHELL',
'SETTINGS_MODULE',
'JOB', # <<--- here...
'LOG_FILE', # <<--- here...
'SLOT', # <<--- here...
'SPIDER', # <<--- and here
}
setting_envvars = {k for k in scrapy_envvars if k not in valid_envvars}
if setting_envvars:
setting_envvar_list = ', '.join(sorted(setting_envvars))
warnings.warn(
'Use of environment variables prefixed with SCRAPY_ to override '
'settings is deprecated. The following environment variables are '
'currently defined: {}'.format(setting_envvar_list),
ScrapyDeprecationWarning
)
This makes the deprecation warning go away and scrapyd schedule can move jobs from pending to running and then finished.
Obviously this is a very bad idea since I'm changing code inside a lib/module/package (or whatever the correct term is) and this would be overwritten by any update by package manager.
So my question is, assuming I have understood the source of the problem correctly, what is the correct way to fix this without changing code inside the scrapy core?
Thanks in advance for your input!