Questions tagged [scrapyd]

`Scrapyd` is a daemon for managing `Scrapy` projects. The project used to be part of `scrapy` itself, but was separated out and is now a standalone project. It runs on a machine and allows you to deploy (aka. upload) your projects and control the spiders they contain using a JSON web service.

Scrapyd can manage multiple projects and each project can have multiple versions uploaded, but only the latest one will be used for launching new spiders.

355 questions
2
votes
0 answers

Python: high cpu scrapy how to debug

I'm trying to debug my cpu usage. I tried several things already: add sleep(0.1) to Pipelines disabling Pipelines Using Scrapyd with job persistance (JOBDIR paramater to save data to the disk instead of keeping it in memory), but I guess this only…
Erik van de Ven
  • 4,747
  • 6
  • 38
  • 80
2
votes
2 answers

Rename output file after scrapy spider complete

I am using Scrapy and Scrapyd to monitor certain sites. The output files are compressed jsonlines. Right after I submit a job schedule to scrapyd, I can see the output file being created and is growing as it scrapes. My problem is I can't be sure…
Andy
  • 1,231
  • 1
  • 15
  • 27
2
votes
1 answer

Implementing own scrapyd service

I want to create my own service for scrapyd API, which should return a little more information about running crawler. I get stuck at very beginning: where I should place the module, which will contain that service. If we look at default…
ilov3
  • 427
  • 2
  • 7
2
votes
1 answer

Scrapyd schedule details into database

Hi I am using Scrpayd to schedule my spiders. The problem is that i want to keep track of all the historic information about the jobs scheduled so far. But if the scrapyd server re-starts all the information will be deleted. My question is is there…
backtrack
  • 7,996
  • 5
  • 52
  • 99
2
votes
0 answers

Can't connect to scrapyd api

I'm trying to use the scrapyd service to schedule a spider. I'm on mac os 10.9.5. I start the service by running 'Scrapyd', it's running fine and I can navigate to the interface on http://localhost:6800. But when I try to use the api, for example:…
user1009453
  • 707
  • 2
  • 11
  • 28
2
votes
0 answers

scrapyd job didn't finished

I used scrapyd run scrapy jobs. And the 127.0.0.1:6800 show the job is finished. But I open the log,it didn't have error message or finished info like this {'downloader/request_bytes': 1685, 'downloader/request_count': 4, …
user2492364
  • 6,543
  • 22
  • 77
  • 147
2
votes
1 answer

scrapyd Error on schedule new spider

I cannot schedule a spider run Deploy seems to be ok: Deploying to project "scraper" in http://localhost:6800/addversion.json Server response (200): {"status": "ok", "project": "scraper", "version": "1418909664", "spiders": 3} I scheduling a new…
sergiuz
  • 5,353
  • 1
  • 35
  • 51
2
votes
1 answer

unable to deploy portia spider with scrapyd-deploy

Could you please help me figure out what I'm doing wrong ? Here are the steps: followed the portia install manual found here https://github.com/scrapinghub/portia - all ok created a new project, entered an url, tagged an item - all ok clicked…
Mihai
  • 133
  • 1
  • 14
2
votes
1 answer

scrapyd shared middleware and pipeline code

I have several scrapy projects that I have deployed to a scrapyd instance. They all tend to use the same middleware code that I have created and that I have duplicated amongst the projects. I would like to avoid this duplication of code. Is there a…
trajan
  • 1,093
  • 2
  • 12
  • 15
2
votes
1 answer

pymongo.errors.ConnectionFailure: timed out from an ubuntu ec2 instance running scrapyd

So... I'm running scrapyd on my ubuntu ec2 instance after following this post: http://www.dataisbeautiful.io/deploying-scrapy-ec2/ however I guess I can't get pymongo to connect to my MongoLabs mongo database, since the ubuntu ec2 scrapyd logs are…
pyramidface
  • 1,207
  • 2
  • 17
  • 39
2
votes
1 answer

Heavy CPU Usage by scrapy crawler

I've multiple spiders running in multiple instances (4) parallelly. All of them are using almost 100% cpu usage. I've deployed them using scrapyd. Tried changing scrapyd settings like…
Sravan
  • 61
  • 6
2
votes
1 answer

Scrapyd Error: exceptions.AttributeError: 'dict' object has no attribute 'fields'

I recently published a working scrape to scrapyd. I'm getting the error message below when I run the scrape. I reviewed this closed issue: https://github.com/scrapy/scrapy/issues/86 and implemented the recommended fix per the docs:…
dfriestedt
  • 483
  • 1
  • 3
  • 18
2
votes
0 answers

egg file needs permission in scrapyd while deploying

If I do this for another project means it shows. $ scrapy deploy scrapyd Packing version 1412325181 Deploying to project "project2" in http://localhost:6800/addversion.json Server response (200): {"status": "error", "message": "[Errno 13] Permission…
2
votes
1 answer

Schedule a spider in scrapyd and pass spider config options

I'm trying to configure spiders created with slyd to use scrapy-elasticsearch, so I'm sending -d parameter=value to configure it: curl http://localhost:6800/schedule.json -d project=myproject -d spider=myspider -d setting=CLOSESPIDER_ITEMCOUNT=100…
localhost
  • 55
  • 1
  • 6
2
votes
4 answers

Scrapyd with Polipo and Tor

UPDATE: I am now running this command: scrapyd-deploy And getting this error: 504 Connect to localhost:8123 failed: General SOCKS server failure I am trying to deploy my scrapy spider through scrapyd-deploy, the following is the…
Moataz Elmasry
  • 542
  • 1
  • 5
  • 18