Questions tagged [scrapyd]

`Scrapyd` is a daemon for managing `Scrapy` projects. The project used to be part of `scrapy` itself, but was separated out and is now a standalone project. It runs on a machine and allows you to deploy (aka. upload) your projects and control the spiders they contain using a JSON web service.

Scrapyd can manage multiple projects and each project can have multiple versions uploaded, but only the latest one will be used for launching new spiders.

355 questions
1
vote
1 answer

Custom JSON Response from Scrapy Spider Deployed via Scrapyd

I need to find a way to make my Scrapy spider return a custom JSON response. It is deployed via scrapyd using schedule.json. Schedule.json responds with JobID and Status, but I'd like to add some more data to that response. If there's a way I could…
ChristianTL
  • 115
  • 5
1
vote
0 answers

Scrapyd S3 feed export "Connection Reset by Peer"

I'm running Scrapyd with a FEED_URI set to export to S3, but I received the following error at the very end of my scrape. Note that it successfully uploaded a few hundred kb of data to the bucket as the scrape began, then threw this error at the…
szxk
  • 1,769
  • 18
  • 35
1
vote
2 answers

scrapyd: curl error `unknown or corrupt egg`

I'am trying to update version of my spider, i wrote: curl http://localhost:6800/addversion.json -d project=comicvn -d spider=comicvn2 -d version= 141667324 -d egg=14116674324.egg It made error : {"status"": error,"message": "ValuesError: Unkow or…
tuancoi
  • 35
  • 6
1
vote
2 answers

Projects were not shown in scrapyd

I am new to scrapyd, I have insert the below code into scrapy.cfg file. [settings] default = uk.settings [deploy:scrapyd] url = http://localhost:6800/ project=ukmall [deploy:scrapyd2] url = http://scrapyd.mydomain.com/api/scrapyd/ username =…
1
vote
2 answers

Scrapyd: Writing CSV file to remote server

I'm trying to schedule a crawler on EC2 and have the output export to a csv file cppages-nov.csv, while creating a jobdir encase I need to pause the crawl, but it is not creating any files. Am I using the correct feed exports? curl…
Jason Youk
  • 802
  • 2
  • 8
  • 20
1
vote
1 answer

Restricting access to port 6800

I've recently set up my first Ubuntu server and installed scrapy and scrapyd. I've written a few spiders, and I've figured out how to execute the spiders through the API on port 6800. I also noticed there's a web interface there. I've also noticed…
Chad Casey
  • 181
  • 1
  • 6
1
vote
1 answer

Scrapyd Permission Denied on Deploy

I'm very new to Scrapyd, and am trying to deploy. I am running on Ubuntu 12.04 and installed the ubuntu version of Scrapyd. When I run scrapy deploy default -p pull_scrapers it returns Packing version 1407616523 Deploying to project "pull_scrapers"…
robert
  • 819
  • 1
  • 10
  • 24
1
vote
1 answer

Change number of running spiders scrapyd

Hey so I have about 50 spiders in my project and I'm currently running them via scrapyd server. I'm running into an issue where some of the resources I use get locked and make my spiders fail or go really slow. I was hoping their was some way to…
rocktheartsm4l
  • 2,129
  • 23
  • 38
1
vote
2 answers

Launching Scrapyd with multiple configurations

I'm trying to develop my Scrapy application using multiple configurations depending on my environment (e.g. development, production). My problem is that there are some settings that I'm not sure how to set them. For example, if I have to setup my…
ivangoblin
  • 161
  • 1
  • 3
1
vote
2 answers

Scrapy - Load a yaml file with a relative path inside the spider

I'm trying to deploy my scrapy crawlers, but the problem is that I have a yaml file that I'm trying to load from inside the spider, this works when the spider is loaded from the shell: scrapy crawl . But when the spider is deployed…
Hakim
  • 3,225
  • 5
  • 37
  • 75
1
vote
1 answer

Schedule spider with SCRAPYD

I'am trying to schedule a spider run, i wrote: curl http://localhost:6800/schedule.json -d project=elettronica -d spider=Prokoo return: {"status": "error", "message": "'elettronica'"} In scrapyd.log i see: 2014-04-16 17:55:16+0200…
1
vote
1 answer

how to optimize Scrapyd setting for 200+ spider

My scrapyd is handling 200 spiders at once daily . Yesterday, the server crashed because RAM hit its cap. I am using scrapyd default setting [scrapyd] http_port = 6800 debug = off #max_proc = 1 eggs_dir = /var/lib/scrapyd/eggs dbs_dir =…
Michael Nguyen
  • 1,691
  • 2
  • 18
  • 33
1
vote
3 answers

Keep scrapyd running

I have scrapy and scrapyd installed on a debian machine. I log in to this server using a ssh-tunnel. I then start scrapyd by going: scrapyd Scrapyd starts up fine and I then open up another ssh-tunnel to the server and schedule my spider…
user1009453
  • 707
  • 2
  • 11
  • 28
1
vote
1 answer

How can I automate my spider runs using scrapyd?

I know this probably seems ridiculous. I have given up on a windows scrapyd implementation and have set up a ubuntu machine and got everything working just great. I ahve 3 projects each with their own spider. I can run my spiders from the terminal…
Mark
  • 195
  • 1
  • 18
1
vote
2 answers

how to start scrapyd server on EC2 instance

I have setup an instance on aws. Now I want to start scrapyd on a particular port. according to documentation aptitude install scrapyd-X.YY but aptitude is not found. I have tried to installing aptitude using yum but there is no match found (may…
Tasawer Nawaz
  • 927
  • 8
  • 19