Questions tagged [scrapy-pipeline]

218 questions
1
vote
1 answer

Scrapy Pipeline to update mysql for each start_url

I have a spider that reads the start_urls from a MySQL database and scrapes an unknown number of links from each page. I want to use pipelines.py to update the database with the scraped links but I don't know how to get the start_url back in to the…
SDailey
  • 17
  • 3
1
vote
4 answers

scrapy csvpipeline to export csv according to spiders name or id

I have two different spiders running. I was looking to write 2 different csv files named after spider name. spider1.csv data from spider1 and spider2.csv for data from spider2 Here's my CsvPipeline class: class CsvPipeline(object): def…
CodeNinja101
  • 1,091
  • 4
  • 11
  • 19
1
vote
1 answer

How to stop Multithreading or How to give request one by one in Scrapy?

I tried to crawl data of a product in the format. 1) ADD CART 2) VIEW CART 3) REMOVE CART For single color product it is working perfectly but for multi-color product, Scrapy takes multi-threading so above process is not in order for each and every…
Vimal Annamalai
  • 139
  • 1
  • 2
  • 12
1
vote
1 answer

Scrapy Connect Different Items for Yield

I scrap news site. For every news, there are content, and many comments. I have 2 Item, one for content, and other for multiple comments. Problem is content and multiple comments yield as different request. I want news' content and its multiple…
1
vote
1 answer

Scrapy: How to clean response ?

Here is my code snippet. I am trying scrape a website using Scrapy and then store data in Elasticsearch for indexing. def parse(self, response): for news in response.xpath('head'): yield { 'pagetype':…
Slyper
  • 896
  • 2
  • 15
  • 32
1
vote
0 answers

Reading existing Django models from inside Scrapy spider

I am working on a project where urls are put into a Django model called UrlItems. The models.py file containing UrlItems is located in the home app. I typed scrapy startproject scraper in the same directory as the models.py file. Please see this…
kas
  • 857
  • 1
  • 15
  • 21
1
vote
0 answers

FLask-SQLAlchemy checking for duplicates in database

Hi I am using the python scrape library to create spiders and extract data from websites. In my pipeline I use Flask-SQLAlchemy to configure the spider so that it adds the scraped data to a SQLite table. I am trying to figure out how to prevent the…
A. Sharma
  • 33
  • 6
1
vote
0 answers

Make scrapy pipeline wait on another item in the same or previous pipeline

My problem is as follows: I have 3 item pipelines one FilesPipeline that download archives one ArchiveUnpackerPipeline that unpacks an archive one SymbolicLinkerPipeline that generates symbolic links to the contents of those archives The issue is…
ZeeD26
  • 11
  • 3
1
vote
3 answers

How to enable overwriting a file everytime in scrapy item export?

I am scraping a website which returns in a list of urls. Example - scrapy crawl xyz_spider -o urls.csv It is working absolutely fine now I want is to make new urls.csv not append data into the file. Is there any parameter passing I can do to make it…
Nikhil Parmar
  • 876
  • 2
  • 11
  • 27
1
vote
0 answers

Scrapy: How to send data to the pipeline from a custom filter without downloading

To catch all redirection paths, including when the final url was already crawled, I wrote a custom duplicate filter: import logging from scrapy.dupefilters import RFPDupeFilter from seoscraper.items import RedirectionItem class…
Antoine Brunel
  • 1,065
  • 2
  • 14
  • 30
1
vote
1 answer

Scrapy: Changing media pipeline download priorities: How to delay media files downloads at the very end of the crawl?

http://doc.scrapy.org/en/latest/topics/media-pipeline.html When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and…
Antoine Brunel
  • 1,065
  • 2
  • 14
  • 30
1
vote
0 answers

Scrapinghub mySQL Pipeline

I'm trying to create a Scrapy Pipeline that exports the scraped data to a mySQL database. I've written my script (pipeline.py): from datetime import date time from hashlib import md5 from scrapy import log from scrapy.exceptions import DropItem from…
NickT
  • 31
  • 1
1
vote
1 answer

Feed Rethinkdb with scrapy

I'm looking for a simple tutorial explaining how to write items to Rethinkdb from scrapy. The equivalent can be found for MongoDB here.
1
vote
1 answer

scrapyd multiple spiders writing items to same file

I have scrapyd server with several spiders running at same time, I start the spiders one by one using the schedule.json endpoint. All spiders are writing contents on common file using a pipeline class JsonWriterPipeline(object): def __init__(self,…
silvestrelosada
  • 55
  • 1
  • 10
1
vote
1 answer

How do I get a normal url from redis rather than through url cPikle converted?

I use scrapy-redis simple to build a distributed crawler, slave machine needs to read url form master queue url, but there is a problem is that I get to url slave machine is after cPikle converted data, I want to get url from redis-url-queue is…
rowele
  • 85
  • 10