Questions tagged [scrapy-pipeline]
218 questions
1
vote
2 answers
Scrapy: export parsed data into multiple files
Id like to parse pages and then export certain items to one csv file and other to another file:
using feed exports here I managed to do it for one file as follows:
settings
FEED_EXPORT_FIELDS = (
'url',
'group_url',
'name',
…

rafalf
- 425
- 7
- 16
1
vote
2 answers
Filter duplicate entries when exporting items with append mode in csvexports in Scrapy
I am trying to figure out how to pre-check whether an item is already present in a row in the csv file to be exported. If the item is not present, then the item needs to be appended. Else the item should be discarded. So far I have done following in…

Dhiraz Gazurel
- 104
- 1
- 10
1
vote
1 answer
give output file name inside the crawler scrapy
I have scrapy project written in python 3.6. and project have 3 crawlers it simply scrape items from 3 different website one crawler for each website. I am using item from items.py in script doing yield item each crawler has minor different in items…

itsmnthn
- 1,898
- 1
- 18
- 30
1
vote
0 answers
Items randomly never reach pipeline when iterating over a list
I'm fairly new to scrapy. I'm crawling a site that offers a list of websites. I want to crawl certain pages of those websites.
Example: siteX.com has the following list:
http://siteA.com
http://siteB.com
http://siteC.com
All those pages have…

user3255061
- 1,757
- 1
- 30
- 50
1
vote
1 answer
Pipeline to remove None values
My spider yields certain data but sometimes it doesn't find the data.
Instead of setting a condition such as below:
if response.xpath('//div[@id="mitten"]//h1/text()').extract_first():
result['name'] =…

Casper
- 1,435
- 10
- 22
1
vote
2 answers
How To Keep/Export Field Items in Specific Order Per Spider Class Definition, Utilizing The Items Pipeline in Scrapy
I have a spider which exports data to different CSV files (per the names of the class definitions as defined in the spider class). However, I also wanted to keep the order of the fields in a specific order as they were being processed and exported…

NeilR
- 46
- 7
1
vote
0 answers
Test or Mock Scrapy Pipeline
I was looking into testing the scrapy pipeline, (I already know the spider works) when it occurred to me I could just use a local copy of a page from the target website instead of repeatedly hitting it with my spider online. But I did not see…

Malik A. Rumi
- 1,855
- 4
- 25
- 36
1
vote
1 answer
Invoke scrapy's custom exporter by command line
While trying to resolve my problem (output an ordered Json array by a specific item's field), I've received an answer that suggests me to create a custom exporter for the job.
I'm creating one, but... all the examples that I've find suggest to call…

Lore
- 1,286
- 1
- 22
- 57
1
vote
1 answer
How to fetch data using scrapy?
I am working on a Django project and I want to provide some news feeds to the home page. I recently got interact with scrapy, when I run given code with "scarpy shell", this code is able to fetch the data successfully. But when I put this code into…

jax
- 3,927
- 7
- 41
- 70
1
vote
0 answers
Using Scrapy JsonItemsLinesExporter, returns no value
I have multiple spiders within one scraping program, I am trying to run all spiders simultaneously out of a script and then dump the contents to a JSONfile. When I use the shell on each individual spider and do -o xyz.json it works fine.
I've…

Artie
- 82
- 1
- 8
1
vote
2 answers
Access Instance of scrapy pipeline class
I want to access the variable self.cursor to make use of the active postgreSQL connection, but i am unable to figure out how to access the scrapy's instance of the pipeline class.
class ScrapenewsPipeline(object):
def open_spider(self, spider):
…

atb00ker
- 957
- 13
- 24
1
vote
0 answers
Scrapy: manipulate same item multiple times on different functions before yield-ing
I have a spider that scrapes data from a webpage and writes the title, text and img url to mongoDB.
I have two functions:
def parse_news(self, response):
item = NewsItem()
item['_id'] = .. #key for MongoDB - Unique
item['Title'] = ..
…

endritius
- 11
- 3
1
vote
1 answer
How can I determine whether Scrapy encountered errors, in the Pipeline.close_spider() method?
I have a Scrapy spider and Pipeline setup.
My Spider extracts data from a website and my Pipeline's process_item()method inserts the extracted data into a temporary database table.
At the end, in the Pipeline's close_spider() method I run some error…

buzz
- 326
- 3
- 7
1
vote
3 answers
Unable to pass empty url through scrapy pipeline
I have a list of data objects each of them containing a url to be scraped. Some of these urls are not valid but I still want the data object to fall through to reach item pipelines.
After @tomáš-linhart reply I understood that using a middleware…

comiventor
- 3,922
- 5
- 50
- 77
1
vote
1 answer
Scrapy - condition based crawling
I have the following Scrapy parse method:
def parse(self, response):
item_loader = ItemLoader(item=MyItem(), response=response)
for url in response.xpath('//img/@src').extract():
item_loader.add_value('image_urls',…

Akustik
- 43
- 8