0

I was tasked for work to pull a ton of inventory for a company that was acquired by the company I work for. I thought scrapy would be a great tool for this and so far I'm having fun. However, I am trying to utilize scrapyd and am running into a problem that might be an easy fix. I am using a virtual environment (env) and I am running the following:

  • Ubuntu 20.04
  • Python 3.8
  • Scrapy 2.7.1
  • scrapyd 1.3.0 (twistd 22.10.0)
  • scrapyd-client 1.2.2

I just cloned a scrapy project with a spider that looked interesting from github and started playing around. I am able to run the spider using the command:

scrapy crawl examplespider

It works great and I love it. I followed some tutorials to use scrapyd and deploy my scrapy projects to it and schedule them. This is where I run into the problem I am having. I can confirm that the daemon is running:

(scrape) ubuntu@ip-172-26-13-235:~/env$ curl http://localhost:6800/daemonstatus.json
{"node_name": "ip-172-26-13-235", "status": "ok", "pending": 0, "running": 0, "finished": 11}

Verified that the project was deployed:

(scrape) ubuntu@ip-172-26-13-235:~/env$ curl http://localhost:6800/listprojects.json
{"node_name": "ip-172-26-13-235", "status": "ok", "projects": ["NebulaEmailScraper"]}

Checked the versions (There are a lot because I was troubleshooting trying to make it work)

(scrape) ubuntu@ip-172-26-13-235:~/env$ curl http://localhost:6800/listversions.json?project=NebulaEmailScraper
{"node_name": "ip-172-26-13-235", "status": "ok", "versions": ["1671003419", "1671086362", "1671182432", "1671183695", "1671183711", "1671183723", "1671183824", "1671184333", "1671184826", "1671184854", "1671186281", "1671187105", "1671260058", "1671260865", "1671261377", "1671261947"]}

Checked the spider:

(scrape) ubuntu@ip-172-26-13-235:~/env$ curl http://localhost:6800/listspiders.json?project=NebulaEmailScraper
{"node_name": "ip-172-26-13-235", "status": "ok", "spiders": ["googleemailspider"]}

Checked the jobs that I scheduled and ran:

    (scrape) ubuntu@ip-172-26-13-235:~/env$ curl http://localhost:6800/listjobs.json?project=NebulaEmailScraper | python -m json.tool
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2228  100  2228    0     0   120k      0 --:--:-- --:--:-- --:--:--  120k
{    "node_name": "ip-172-26-13-235",
     "status": "ok",    "pending": [],
     "running": [],    "finished": [        
    {
             "project": "NebulaEmailScraper",
             "spider": "googleemailspider",
             "id": "1ccfce2e7ddc11edbfc8f1be5e75662a",
             "start_time": "2022-12-17 07:26:25.477894",
             "end_time": "2022-12-17 07:26:26.445959"        
    },
    ... I won't list all of them

Here are what the log files look like:

scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-17 06:57:41 [scrapy.middleware] INFO: Enabled item pipelines:
['nebulaemailscraper.pipelines.EmailsDetailsPipeline']
2022-12-17 06:57:41 [scrapy.core.engine] INFO: Spider opened
2022-12-17 06:57:41 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-17 06:57:41 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-17 06:57:41 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-17 06:57:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.002463,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 12, 17, 6, 57, 41, 386908),
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'memusage/max': 63549440,
 'memusage/startup': 63549440,
 'start_time': datetime.datetime(2022, 12, 17, 6, 57, 41, 384445)}
2022-12-17 06:57:41 [scrapy.core.engine] INFO: Spider closed (finished)

Like I said, when I run scrapy crawl [spidername] it works great and says it crawled 100s of pages.

Here is my scrapy.cfg

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = nebulaemailscraper.settings

[deploy]
url = http://localhost:6800/
project = nebulaemailscraper

Here is my settings.py

BOT_NAME = "nebulaemailscraper"
SPIDER_MODULES = ["nebulaemailscraper.spiders"]
NEWSPIDER_MODULE = "nebulaemailscraper.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.1

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
    "nebulaemailscraper.middlewares.NebulaemailscraperSpiderMiddleware": 543,
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'nebulaemailscraper.middlewares.NebulaemailscraperDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
    'scrapy.extensions.telnet.TelnetConsole': None,
}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    "nebulaemailscraper.pipelines.EmailsDetailsPipeline": 300,
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
    "nebulaemailscraper.middlewares.NebulaemailscraperSpiderMiddleware": 543,
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'nebulaemailscraper.middlewares.NebulaemailscraperDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
    'scrapy.extensions.telnet.TelnetConsole': None,
}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    "nebulaemailscraper.pipelines.EmailsDetailsPipeline": 300,
}

I have tried deleting the project and creating a new one and it did not work. I am so tired at the moment that I need to lay down. I will update this question some more after I get a few hours of rest. Thanks everyone for the help. I'm sure it's a config file preventing this from working the way its supposed to.

waltmagic
  • 631
  • 2
  • 9
  • 22
  • print some logs using `self.logger.info("I am here")` and then deploy to make sure your code is being deployed – Umair Ayub Dec 18 '22 at 05:30
  • Sorry I have been busy with two jobs and have found it hard to find time. I did that and I did see the logs. At least I know I'm deploying correctly... – waltmagic Dec 28 '22 at 23:09

0 Answers0