Questions tagged [scrapy-pipeline]
218 questions
1
vote
1 answer
Scrapy Pipeline to update mysql for each start_url
I have a spider that reads the start_urls from a MySQL database and scrapes an unknown number of links from each page. I want to use pipelines.py to update the database with the scraped links but I don't know how to get the start_url back in to the…

SDailey
- 17
- 3
1
vote
4 answers
scrapy csvpipeline to export csv according to spiders name or id
I have two different spiders running. I was looking to write 2 different csv files named after spider name.
spider1.csv data from spider1 and spider2.csv for data from spider2
Here's my CsvPipeline class:
class CsvPipeline(object):
def…

CodeNinja101
- 1,091
- 4
- 11
- 19
1
vote
1 answer
How to stop Multithreading or How to give request one by one in Scrapy?
I tried to crawl data of a product in the format.
1) ADD CART
2) VIEW CART
3) REMOVE CART
For single color product it is working perfectly but for multi-color product, Scrapy takes multi-threading so above process is not in order for each and every…

Vimal Annamalai
- 139
- 1
- 2
- 12
1
vote
1 answer
Scrapy Connect Different Items for Yield
I scrap news site. For every news, there are content, and many comments. I have 2 Item, one for content, and other for multiple comments.
Problem is content and multiple comments yield as different request. I want news' content and its multiple…

Remzi Meric Ceylan
- 75
- 9
1
vote
1 answer
Scrapy: How to clean response ?
Here is my code snippet. I am trying scrape a website using Scrapy and then store data in Elasticsearch for indexing.
def parse(self, response):
for news in response.xpath('head'):
yield {
'pagetype':…

Slyper
- 896
- 2
- 15
- 32
1
vote
0 answers
Reading existing Django models from inside Scrapy spider
I am working on a project where urls are put into a Django model called UrlItems. The models.py file containing UrlItems is located in the home app. I typed scrapy startproject scraper in the same directory as the models.py file. Please see this…

kas
- 857
- 1
- 15
- 21
1
vote
0 answers
FLask-SQLAlchemy checking for duplicates in database
Hi I am using the python scrape library to create spiders and extract data from websites. In my pipeline I use Flask-SQLAlchemy to configure the spider so that it adds the scraped data to a SQLite table. I am trying to figure out how to prevent the…

A. Sharma
- 33
- 6
1
vote
0 answers
Make scrapy pipeline wait on another item in the same or previous pipeline
My problem is as follows:
I have 3 item pipelines
one FilesPipeline that download archives
one ArchiveUnpackerPipeline that unpacks an archive
one SymbolicLinkerPipeline that generates symbolic links to the contents of those archives
The issue is…

ZeeD26
- 11
- 3
1
vote
3 answers
How to enable overwriting a file everytime in scrapy item export?
I am scraping a website which returns in a list of urls.
Example - scrapy crawl xyz_spider -o urls.csv
It is working absolutely fine now I want is to make new urls.csv not append data into the file. Is there any parameter passing I can do to make it…

Nikhil Parmar
- 876
- 2
- 11
- 27
1
vote
0 answers
Scrapy: How to send data to the pipeline from a custom filter without downloading
To catch all redirection paths, including when the final url was already crawled, I wrote a custom duplicate filter:
import logging
from scrapy.dupefilters import RFPDupeFilter
from seoscraper.items import RedirectionItem
class…

Antoine Brunel
- 1,065
- 2
- 14
- 30
1
vote
1 answer
Scrapy: Changing media pipeline download priorities: How to delay media files downloads at the very end of the crawl?
http://doc.scrapy.org/en/latest/topics/media-pipeline.html
When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and…

Antoine Brunel
- 1,065
- 2
- 14
- 30
1
vote
0 answers
Scrapinghub mySQL Pipeline
I'm trying to create a Scrapy Pipeline that exports the scraped data to a mySQL database. I've written my script (pipeline.py):
from datetime import date time
from hashlib import md5
from scrapy import log
from scrapy.exceptions import DropItem
from…

NickT
- 31
- 1
1
vote
1 answer
Feed Rethinkdb with scrapy
I'm looking for a simple tutorial explaining how to write items to Rethinkdb from scrapy. The equivalent can be found for MongoDB here.

crocefisso
- 793
- 2
- 14
- 29
1
vote
1 answer
scrapyd multiple spiders writing items to same file
I have scrapyd server with several spiders running at same time, I start the spiders one by one using the schedule.json endpoint. All spiders are writing contents on common file using a pipeline
class JsonWriterPipeline(object):
def __init__(self,…

silvestrelosada
- 55
- 1
- 10
1
vote
1 answer
How do I get a normal url from redis rather than through url cPikle converted?
I use scrapy-redis simple to build a distributed crawler, slave machine needs to read url form master queue url, but there is a problem is that I get to url slave machine is after cPikle converted data, I want to get url from redis-url-queue is…

rowele
- 85
- 10