Highest Voted 'scrapy-pipeline' Questions

1

vote

1 answer

Scrapy Pipeline to update mysql for each start_url

I have a spider that reads the start_urls from a MySQL database and scrapes an unknown number of links from each page. I want to use pipelines.py to update the database with the scraped links but I don't know how to get the start_url back in to the…

python scrapy scrapy-pipeline

asked Apr 13 '17 at 03:34

SDailey

17
3

1

vote

4 answers

scrapy csvpipeline to export csv according to spiders name or id

I have two different spiders running. I was looking to write 2 different csv files named after spider name. spider1.csv data from spider1 and spider2.csv for data from spider2 Here's my CsvPipeline class: class CsvPipeline(object): def…

python scrapy scrapy-pipeline

asked Apr 08 '17 at 03:43

CodeNinja101

1,091
4
11
19

1

vote

1 answer

How to stop Multithreading or How to give request one by one in Scrapy?

I tried to crawl data of a product in the format. 1) ADD CART 2) VIEW CART 3) REMOVE CART For single color product it is working perfectly but for multi-color product, Scrapy takes multi-threading so above process is not in order for each and every…

python-2.7 scrapy scrapy-pipeline scrapy-shell

asked Feb 02 '17 at 11:39

Vimal Annamalai

139
1
2
12

1

vote

1 answer

Scrapy Connect Different Items for Yield

I scrap news site. For every news, there are content, and many comments. I have 2 Item, one for content, and other for multiple comments. Problem is content and multiple comments yield as different request. I want news' content and its multiple…

scrapy scrapy-pipeline

asked Jan 28 '17 at 19:48

Remzi Meric Ceylan

75
9

1

vote

1 answer

Scrapy: How to clean response ?

Here is my code snippet. I am trying scrape a website using Scrapy and then store data in Elasticsearch for indexing. def parse(self, response): for news in response.xpath('head'): yield { 'pagetype':…

python scrapy scrapy-pipeline

asked Dec 24 '16 at 12:12

Slyper

896
2
15
32

1

vote

0 answers

Reading existing Django models from inside Scrapy spider

I am working on a project where urls are put into a Django model called UrlItems. The models.py file containing UrlItems is located in the home app. I typed scrapy startproject scraper in the same directory as the models.py file. Please see this…

django django-models scrapy scrapy-pipeline

asked Dec 07 '16 at 02:32

kas

857
1
15
21

1

vote

0 answers

FLask-SQLAlchemy checking for duplicates in database

Hi I am using the python scrape library to create spiders and extract data from websites. In my pipeline I use Flask-SQLAlchemy to configure the spider so that it adds the scraped data to a SQLite table. I am trying to figure out how to prevent the…

python scrapy flask-sqlalchemy scrapy-pipeline

asked Dec 07 '16 at 02:05

A. Sharma

33
6

1

vote

0 answers

Make scrapy pipeline wait on another item in the same or previous pipeline

My problem is as follows: I have 3 item pipelines one FilesPipeline that download archives one ArchiveUnpackerPipeline that unpacks an archive one SymbolicLinkerPipeline that generates symbolic links to the contents of those archives The issue is…

scrapy twisted.internet scrapy-pipeline

asked Nov 22 '16 at 15:24

ZeeD26

11
3

1

vote

3 answers

How to enable overwriting a file everytime in scrapy item export?

I am scraping a website which returns in a list of urls. Example - scrapy crawl xyz_spider -o urls.csv It is working absolutely fine now I want is to make new urls.csv not append data into the file. Is there any parameter passing I can do to make it…

python csv scrapy scrapy-pipeline

asked Oct 30 '16 at 09:30

Nikhil Parmar

876
2
11
27

1

vote

0 answers

Scrapy: How to send data to the pipeline from a custom filter without downloading

To catch all redirection paths, including when the final url was already crawled, I wrote a custom duplicate filter: import logging from scrapy.dupefilters import RFPDupeFilter from seoscraper.items import RedirectionItem class…

scrapy scrapy-pipeline

asked Aug 30 '16 at 08:39

Antoine Brunel

1,065
2
14
30

1

vote

1 answer

Scrapy: Changing media pipeline download priorities: How to delay media files downloads at the very end of the crawl?

http://doc.scrapy.org/en/latest/topics/media-pipeline.html When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and…

scrapy scrapy-pipeline

asked Apr 22 '16 at 16:22

Antoine Brunel

1,065
2
14
30

1

vote

0 answers

Scrapinghub mySQL Pipeline

I'm trying to create a Scrapy Pipeline that exports the scraped data to a mySQL database. I've written my script (pipeline.py): from datetime import date time from hashlib import md5 from scrapy import log from scrapy.exceptions import DropItem from…

python mysql python-2.7 scrapy scrapy-pipeline

asked Apr 19 '16 at 16:27

NickT

31
1

1

vote

1 answer

Feed Rethinkdb with scrapy

I'm looking for a simple tutorial explaining how to write items to Rethinkdb from scrapy. The equivalent can be found for MongoDB here.

python-2.7 scrapy rethinkdb rethinkdb-python scrapy-pipeline

asked Apr 13 '16 at 13:34

crocefisso

793
2
14
29

1

vote

1 answer

scrapyd multiple spiders writing items to same file

I have scrapyd server with several spiders running at same time, I start the spiders one by one using the schedule.json endpoint. All spiders are writing contents on common file using a pipeline class JsonWriterPipeline(object): def __init__(self,…

scrapy scrapyd scrapy-pipeline

asked Mar 23 '16 at 15:47

silvestrelosada

55
1
10

1

vote

1 answer

How do I get a normal url from redis rather than through url cPikle converted？

I use scrapy-redis simple to build a distributed crawler, slave machine needs to read url form master queue url, but there is a problem is that I get to url slave machine is after cPikle converted data, I want to get url from redis-url-queue is…

python redis scrapy scrapy-pipeline

asked Mar 21 '16 at 09:24

rowele

85
10

Questions tagged [scrapy-pipeline]