Questions tagged [scrapyd]

`Scrapyd` is a daemon for managing `Scrapy` projects. The project used to be part of `scrapy` itself, but was separated out and is now a standalone project. It runs on a machine and allows you to deploy (aka. upload) your projects and control the spiders they contain using a JSON web service.

Scrapyd can manage multiple projects and each project can have multiple versions uploaded, but only the latest one will be used for launching new spiders.

355 questions
1
vote
1 answer

in `escape': undefined method `gsub' for # (NoMethodError)

Hi I am trying to scrape a web page "take the links" go to that links and "to scrape it" too. require 'rubygems' require 'scrapi' require 'uri' Scraper::Base.parser :html_parser web = "http://......" def sub_web(linksubweb) uri =…
Mike Norton
  • 71
  • 1
  • 9
1
vote
2 answers

generic spider for scrapy project

i am creating generic spider (scrapy spider) for multiple websites. below is my project directory structure. myproject --- __init__.py --- common.py --- scrapy.cfg --- myproject ---__init__.py ---items.py …
AGR
  • 225
  • 1
  • 2
  • 16
1
vote
1 answer

MySQL not saving data that's being scraped

I made a small project using Scrapy. The thing is that my scrapy is crawling pages and scraping data. But it is not being saved into my database. I am using MySQL as my database. I guess there is something that I am missing out in my pipelines.py…
1
vote
0 answers

scraped items not being saved into database

my scrapy not saving data into database. please suggest. it is scraping data,, but not adding those data into the database.. please look into the codes and sggest something,.. My spider.py file from scrapy.spider import BaseSpider from…
Abhimanyu
  • 81
  • 6
1
vote
1 answer

Scrapy deploy stopped working

I am trying to deploy scrapy project using scrapyd but it is giving me error ... sudo scrapy deploy default -p eScraper Building egg of eScraper-1371463750 'build/scripts-2.7' does not exist -- can't clean it zip_safe flag not set; analyzing archive…
Vaibhav Jain
  • 5,287
  • 10
  • 54
  • 114
1
vote
0 answers

Scrapy: having problems in crawling a .aspx page

I'm trying to crawl a .aspx page, but it redirects me to a page which doesn't exist. To solve this, I tried to set 'dont_merge_cookies': True and 'dont_redirect': True, and overwrite my start_requests, but now, it gives me an error "'Response'…
user_2000
  • 1,103
  • 3
  • 14
  • 26
1
vote
0 answers

Running scrapy commands using os.system or subprocess.call

I have a Scrapy project with a web-based interface running on Apache (XAMPP) that allows the user to create, modify and schedule spiders and also includes a call to scrapyd at port 6800 to get the pending/running/finished spiders. It all works…
1
vote
2 answers

libxml2 or lxml error when trying to run the command "scrapy crawl test"

I have the source code follow as: //Spider class test_crawler(BaseSpider): name = 'test' allowed_domains = ['http://test.com'] start_urls = ['http://test.com/test'] def parse(self, response): hxs =…
Thinh Phan
  • 655
  • 1
  • 14
  • 27
0
votes
0 answers

Question regarding beginner Scrapy and scrapy crawl

I recently began to try to learn web scraping using Scrapy. Recently I tried to Scrapycrawl through the books.toscrape.com. According to the terminal, the Scrapycrawl call works fine, but it doesn't return the item count nor does it show any of the…
0
votes
0 answers

Scrapyd launch failure at an imported droplet, DO

A server image has been exported and imported into a new DO account. I've created a droplet of it with authentication thru SSH keys rather than password authentication, see note. Now it's on, yet, as in the console I launch scrapyd the following…
Igor Savinkin
  • 5,669
  • 8
  • 37
  • 69
0
votes
0 answers

How to resume scrapy crawler on startup through scrapyd?

I am trying to run the scrapy crawler through scrapyd with JOBDIR. I have a script in which I am sending the POST request to scrapyd server: scrapyd_script: import requests import json import logging from datetime import…
X-somtheing
  • 219
  • 2
  • 10
0
votes
0 answers

Deploying a spider with a git repo dependency fails eggification

I have a Scrapy project (here after called the_application) that has a dependency on a library (here after called the_library) fetched from a git repository, and everytime I attempt to deploy the Scrapy project by running scrapyd-deploy…
Hrafn
  • 2,867
  • 3
  • 25
  • 44
0
votes
0 answers

Docker image runs fine on local machine, but fails with "/usr/local/bin/scrapyd -n: Unknown command: scrapyd" when deployed on heroku

This is my docker file: FROM python:3.10 WORKDIR /usr/src/app COPY requirements.txt ./ RUN pip install --no-cache-dir -r requirements.txt COPY CollegeXUniversityDataScraper ./CollegeXUniversityDataScraper/ COPY scrapyd.conf ./ ENTRYPOINT […
0
votes
0 answers

Scrapyd deploy failing python 3.8

Stats: I start Scrapyd in env: (env) sh-3.2$ scrapyd 2023-01-18T14:44:21+0400 [-] Loading /Users/parikshit.mukherjee/PycharmProjects/nn/ufc-data-crawler/env/lib/python3.8/site-packages/scrapyd/txapp.py... 2023-01-18T14:44:21+0400 [-] Basic…
0
votes
1 answer

How to correctly configure CONCURRENT_REQUESTS in a project with multiple spiders

I have a project in Scrapy with ~10 spiders, I run a few of them simultaneously using Scrapyd. However, I have doubts whether my CONCURRENT_REQUESTS setting is correct. Currently my CONCURRENT_REQUESTS is 32, but I have seen that they recommend that…
Jalil SA
  • 26
  • 3