0

I am trying to run the scrapy crawler through scrapyd with JOBDIR. I have a script in which I am sending the POST request to scrapyd server:

scrapyd_script:

import requests
import json
import logging
from datetime import datetime

logging.basicConfig(
    filename="scrapyd_script.log",
    format="%(asctime)s %(message)s",
    filemode="w",
)

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def start_job():
    payload = {
        "project": "default",
        "spider": "houzz_crawler",
        "setting": "JOBDIR=houzz_crawler",
    }
    response = requests.post("http://localhost:6800/schedule.json", data=payload)
    return json.loads(response.text)


if __name__ == "__main__":
    job_data = start_job()
    logger.info(job_data)

And I created the systemd service to run scrapyd_script.py on reboot.

scrapyd_script.service:

[Unit]
Description=My Lovely Service
After=network.target

[Service]
Type=idle
Restart=on-failure
User=root
ExecStart=/bin/bash -c 'cd /home/..../houzz/ && source venv/bin/activate && python /home/..../houzz/houzz_crawler/scrapyd_script.py'

[Install]
WantedBy=multi-user.target

Service is getting started on reboot but the problem is every time system reboot crawler starts from start instead of resuming the crawler where it left off. How can I resume the crawler from it's previous state on system reboot?

X-somtheing
  • 219
  • 2
  • 10

0 Answers0