Questions tagged [scrapyd]

`Scrapyd` is a daemon for managing `Scrapy` projects. The project used to be part of `scrapy` itself, but was separated out and is now a standalone project. It runs on a machine and allows you to deploy (aka. upload) your projects and control the spiders they contain using a JSON web service.

Scrapyd can manage multiple projects and each project can have multiple versions uploaded, but only the latest one will be used for launching new spiders.

355 questions
1
vote
1 answer

scrapy_splash.SplashRequest doesn't execute callback function when scheduled by scrapyd

I did encounter some strange behaviour (to my perspective of knowledge) of SplashRequest's callback when it's executed by scrapyd. Scrapy Sourcecode from scrapy.spiders.Spider import Spider from scrapy import Request import scrapy from scrapy_splash…
1
vote
1 answer

Update spider code controlled by scrapyd

What is the proper way to install/activate a spider that is controlled by scrapyd? I install a new spider version using scrapyd-deploy; a job is currently running. Do I have to stop the job using cancel.json, then schedule a new job?
Markus
  • 2,412
  • 29
  • 28
1
vote
3 answers

How to make the .sh file should be in running state always

I'm new to shell scripting, i want the command to be in running always. My .sh file - startscrapy.sh #!/bin/bash echo "Scrapyd is started now" scrapyd i have changed the permission also chmod +x etc/init.d/startscrapy.sh I have placed this file…
Vimal Annamalai
  • 139
  • 1
  • 2
  • 12
1
vote
0 answers

Portia spider not crawling items

I have created a spider using Portia UI and I have deployed and scheduled in one of my virtual machine using scrapyd. Spider ran fine and scraped website contents. But when I try to deploy and schedule the same spider in another similar virtual…
Prabhakar
  • 1,138
  • 2
  • 14
  • 30
1
vote
2 answers

Securing scrapyd's APIs and Web Interface

I have setup Scrapyd to manage Scrapy spiders in a better way and it is doing that really fine. I am just doubtful about how to secure it as I fear anyone who gets to know that this is a Scrapyd server can use the APIs to manipulate the working of…
harkirat1892
  • 453
  • 5
  • 19
1
vote
0 answers

Scrapyd on Heroku can't recognize rewritten DATABASE_URL by heroku-buildpack-pgbouncer

Okay, here is my setup. I'm on Heroku running a scrapyd daemon using the scrapy-heroku package https://github.com/dmclain/scrapy-heroku. I'm having issues running out of database connections. I decided to try pooling the database connections use…
jeffjv
  • 3,461
  • 2
  • 21
  • 28
1
vote
2 answers

scrapyd - error while running spiders simultaneously

I'm trying to run two scrapy spiders simultaneously using scrapyd. I execute curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider result - {"status": "ok", "jobid":…
Rainmaker
  • 10,294
  • 9
  • 54
  • 89
1
vote
1 answer

Disable Scrapyd item storing in .jl feed

Question I want to know how to disable Item storing in scrapyd. What I tried I deploy a spider to the Scrapy daemon Scrapyd. The deployed spider stores the spidered data in a database. And it works fine. However Scrapyd logs each scraped Scrapy…
Pullie
  • 2,685
  • 3
  • 25
  • 31
1
vote
1 answer

scrapyd multiple spiders writing items to same file

I have scrapyd server with several spiders running at same time, I start the spiders one by one using the schedule.json endpoint. All spiders are writing contents on common file using a pipeline class JsonWriterPipeline(object): def __init__(self,…
silvestrelosada
  • 55
  • 1
  • 10
1
vote
1 answer

Scrapy DEPTH_PRIORITY don't work

I would like my Spider Crawl the start_urls website entirely before follow more deeply the websites. The crawler aim to find expired domains. For exemple I create a page with 500 urls (450 expired & 50 actif websites), the crawler must insert in…
Pixel
  • 900
  • 1
  • 13
  • 31
1
vote
1 answer

on deploying egg file in scrapyd server then {"status": "error", "message": "IndexError: list index out of range"}

Deploying to project "projectname" in http://127.0.0.1:6800/addversion.json Server response (200): {"status": "error", "message": "IndexError: list index out of range"} when I create egg file , and deploy in scrapyd server , then such kind of error…
Pythonsguru
  • 424
  • 3
  • 11
1
vote
0 answers

how to make your scrapy spiders deploy for long term running

I'm building a scraper with Scrapy framework in order to scrap a webshop.This webshop has several cat and subcat I finished already the spider and it works like a charm. I currently use it by using the start url =[] parameter for the spider (…
Andronaute
  • 379
  • 3
  • 12
1
vote
2 answers

Fail to scrapyd-deploy

Traceback (most recent call last): File "/usr/local/bin/scrapyd-deploy", line 273, in main() File "/usr/local/bin/scrapyd-deploy", line 95, in main egg, tmpdir = _build_egg() File "/usr/local/bin/scrapyd-deploy", line 240, in…
wyuan
  • 11
  • 4
1
vote
1 answer

Install old version of scrapyd

I tried various way to install old version of scrapyd but not succeed.. sudo pip install scrapyd-0.24.6 sudo apt-get install scrapyd-0.24.6 pls tell me how I can download and install scpecfic version of scrapyd Thanks
Pythonsguru
  • 424
  • 3
  • 11
1
vote
1 answer

Scrapyd vs Windows Task Scheduler

I want to run a small set of Scrapy spiders on an Azure virtual machine. I'm looking for an automation solution. For the time being it seems like Windows Task Scheduler will do the job for running 3-5 spiders on one vm instance. The only concern I…
Turo
  • 1,537
  • 2
  • 21
  • 42