I was tasked for work to pull a ton of inventory for a company that was acquired by the company I work for. I thought scrapy would be a great tool for this and so far I'm having fun. However, I am trying to utilize scrapyd and am running into a problem that might be an easy fix. I am using a virtual environment (env) and I am running the following:
- Ubuntu 20.04
- Python 3.8
- Scrapy 2.7.1
- scrapyd 1.3.0 (twistd 22.10.0)
- scrapyd-client 1.2.2
I just cloned a scrapy project with a spider that looked interesting from github and started playing around. I am able to run the spider using the command:
scrapy crawl examplespider
It works great and I love it. I followed some tutorials to use scrapyd and deploy my scrapy projects to it and schedule them. This is where I run into the problem I am having. I can confirm that the daemon is running:
(scrape) ubuntu@ip-172-26-13-235:~/env$ curl http://localhost:6800/daemonstatus.json
{"node_name": "ip-172-26-13-235", "status": "ok", "pending": 0, "running": 0, "finished": 11}
Verified that the project was deployed:
(scrape) ubuntu@ip-172-26-13-235:~/env$ curl http://localhost:6800/listprojects.json
{"node_name": "ip-172-26-13-235", "status": "ok", "projects": ["NebulaEmailScraper"]}
Checked the versions (There are a lot because I was troubleshooting trying to make it work)
(scrape) ubuntu@ip-172-26-13-235:~/env$ curl http://localhost:6800/listversions.json?project=NebulaEmailScraper
{"node_name": "ip-172-26-13-235", "status": "ok", "versions": ["1671003419", "1671086362", "1671182432", "1671183695", "1671183711", "1671183723", "1671183824", "1671184333", "1671184826", "1671184854", "1671186281", "1671187105", "1671260058", "1671260865", "1671261377", "1671261947"]}
Checked the spider:
(scrape) ubuntu@ip-172-26-13-235:~/env$ curl http://localhost:6800/listspiders.json?project=NebulaEmailScraper
{"node_name": "ip-172-26-13-235", "status": "ok", "spiders": ["googleemailspider"]}
Checked the jobs that I scheduled and ran:
(scrape) ubuntu@ip-172-26-13-235:~/env$ curl http://localhost:6800/listjobs.json?project=NebulaEmailScraper | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2228 100 2228 0 0 120k 0 --:--:-- --:--:-- --:--:-- 120k
{ "node_name": "ip-172-26-13-235",
"status": "ok", "pending": [],
"running": [], "finished": [
{
"project": "NebulaEmailScraper",
"spider": "googleemailspider",
"id": "1ccfce2e7ddc11edbfc8f1be5e75662a",
"start_time": "2022-12-17 07:26:25.477894",
"end_time": "2022-12-17 07:26:26.445959"
},
... I won't list all of them
Here are what the log files look like:
scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-17 06:57:41 [scrapy.middleware] INFO: Enabled item pipelines:
['nebulaemailscraper.pipelines.EmailsDetailsPipeline']
2022-12-17 06:57:41 [scrapy.core.engine] INFO: Spider opened
2022-12-17 06:57:41 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-17 06:57:41 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-17 06:57:41 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-17 06:57:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.002463,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 12, 17, 6, 57, 41, 386908),
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'memusage/max': 63549440,
'memusage/startup': 63549440,
'start_time': datetime.datetime(2022, 12, 17, 6, 57, 41, 384445)}
2022-12-17 06:57:41 [scrapy.core.engine] INFO: Spider closed (finished)
Like I said, when I run scrapy crawl [spidername]
it works great and says it crawled 100s of pages.
Here is my scrapy.cfg
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html
[settings]
default = nebulaemailscraper.settings
[deploy]
url = http://localhost:6800/
project = nebulaemailscraper
Here is my settings.py
BOT_NAME = "nebulaemailscraper"
SPIDER_MODULES = ["nebulaemailscraper.spiders"]
NEWSPIDER_MODULE = "nebulaemailscraper.spiders"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.1
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
"nebulaemailscraper.middlewares.NebulaemailscraperSpiderMiddleware": 543,
}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'nebulaemailscraper.middlewares.NebulaemailscraperDownloaderMiddleware': 543,
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
"nebulaemailscraper.pipelines.EmailsDetailsPipeline": 300,
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
"nebulaemailscraper.middlewares.NebulaemailscraperSpiderMiddleware": 543,
}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'nebulaemailscraper.middlewares.NebulaemailscraperDownloaderMiddleware': 543,
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
"nebulaemailscraper.pipelines.EmailsDetailsPipeline": 300,
}
I have tried deleting the project and creating a new one and it did not work. I am so tired at the moment that I need to lay down. I will update this question some more after I get a few hours of rest. Thanks everyone for the help. I'm sure it's a config file preventing this from working the way its supposed to.