Questions tagged [scrapyd]

`Scrapyd` is a daemon for managing `Scrapy` projects. The project used to be part of `scrapy` itself, but was separated out and is now a standalone project. It runs on a machine and allows you to deploy (aka. upload) your projects and control the spiders they contain using a JSON web service.

Scrapyd can manage multiple projects and each project can have multiple versions uploaded, but only the latest one will be used for launching new spiders.

355 questions
1
vote
0 answers

Scrapyd die after specified time deployed on server

I have a spider deployed successfully on localhost and running day after day without any trouble When deployed to my ubuntu server, process started and spider run, but after a short time, my scrapyd process stoped without give me a clue I've…
xuke
  • 45
  • 7
1
vote
1 answer

Dynamic DEPTH_LIMIT as parameter in Scrapy, passed from Scrapyd

I am currently using Scrapyd to start a crawling spider and the DEPTH_LIMIT setting is set in the Scrapy App settings. I was wondering how to pass the depth_limit as a parameter in Scrapyd, allowing me to set it "dynamically" as requested by the…
Nicolò Gasparini
  • 2,228
  • 2
  • 24
  • 53
1
vote
1 answer

Scrapyd jobs not starting

I integrated scrapy in my Django project following this guide Unfortunately, In any way I try, the spider jobs are not starting, even if schedule.json gives me a jobid in return. My views: @csrf_exempt @api_view(['POST']) def crawl_url(request): …
Nicolò Gasparini
  • 2,228
  • 2
  • 24
  • 53
1
vote
1 answer

How to skip Parent directories while scraping a File Type Website?

While scraping through a Basic Folder System Website that uses Directories to store file, yield scrapy.Request(url1, callback=self.parse) follows the links and scrapes all the content of the crawled link, but I'm usually encountered with the…
1
vote
1 answer

Get the response if site didn't crawl due to robots.txt

I'm trying to crawl user defined websites but not able to crawl the site where robots.txt is preventing the crawling. That's fine but I want to get the response where I can show to user that "the site you have entered doesn't allow to crawl due to…
Dhaval
  • 901
  • 3
  • 8
  • 26
1
vote
0 answers

Modules folder in Scrapinghub

I'm currently using Scrapinghub's Scrapy Cloud to host my 12 spiders (and 12 differnet projects). I'd like to have one folder with functions that are used by all 12 spiders but not sure what the best way to implement it without having 1 functions…
Axel Eriksson
  • 105
  • 1
  • 11
1
vote
0 answers

Speed up scrapy spiders initialisation time

I have multiple Scrapy spiders that I need to run at the same time every 5 minutes. The issue is that they take almost 30 sec to 1 minute to start. It's seem that they all start their own twisted engine, and so it take a lot of time. I've look into…
fast_cen
  • 1,297
  • 3
  • 11
  • 28
1
vote
1 answer

How to set max_proc_per_cpu in Scrapyd

I have the following two Scrapy projects with the following configurations The Project1's scrapy.cfg [settings] default = Project1.settings [deploy] url = http://localhost:6800/ project = Project1 [scrapyd] eggs_dir = eggs logs_dir =…
Yuseferi
  • 7,931
  • 11
  • 67
  • 103
1
vote
0 answers

2 RabbitMQ workers and 2 Scrapyd daemons running on 2 local Ubuntu instances, in which one of the rabbitmq worker is not working

I am currently working on building "Scrapy spiders control panel" in which I am testing this existing solution available on [Distributed Multi-user Scrapy Spiders Control Panel]…
1
vote
0 answers

Scrapyd, Celery and Django running with Supervisor - GenericHTTPChannellProtocol Error

I'm using a project called Django Dynamic Scraper to build a basic web scraper on top of Django. Everything works find in development but when setting up on my Digital Ocean VPS I run into issues. I'm using Supervisor to keep three things…
Dean Sherwin
  • 478
  • 5
  • 13
1
vote
0 answers

How do I add the same scrapy pipeline to any spider in scrapyd

I have several projects running in scrapyd and all uses the same pipeline, so How can I add this pipeline to every scheduled spider as default with out adding anything to the curl request, only having a flag in default_scrapyd.conf file?
Jgaldos
  • 540
  • 1
  • 5
  • 9
1
vote
0 answers

Why does Scrapyd scheuled spider encounter 503 when trying to scrape site?

I am learning about python and scraping and wrote my first spider using Scrapy. It works fine when I run it locally to scrape my test site it works fine. I deployed the project on my remote server in Scrapyd but when I schedule the spider to run…
Dark Star1
  • 6,986
  • 16
  • 73
  • 121
1
vote
1 answer

FEED_EXPORT_ENCODING option not working for for Items files in Scrapyd - Python Scrapy

I am scraping Chinese website. I have FEED_EXPORT_ENCODING='utf-8' in settings.py file. If I run my scraper via scrapy crawl myscraper -o output.json Then my output file shows correct Chinese. But if I start my scraper via Scrapyd then the Items…
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
1
vote
1 answer

How install Crawlera via setuptools

I want to install crawlera avec setuptools in docker. in my scrapy.cfg file i have: [deploy=test] url = http://localhost:6800/ project = Crawling i test by scrapyd-deploy -l and i have test http://localhost:6800/ in my setup.py i…
parik
  • 2,313
  • 12
  • 39
  • 67
1
vote
0 answers

scrapyd service and periodic scraping in virtualenv

First time I installed scrapyd in Ubuntu 14.04, I didn't use the generic way. Using apt-get, my scrapyd was considered a service that can be started and have (log/config/dbs...) dependencies however the scrapy version was very outdated. So I…
user2243952
  • 277
  • 3
  • 6
  • 12