Highest Voted 'scrapy-pipeline' Questions

1

vote

2 answers

Scrapy: export parsed data into multiple files

Id like to parse pages and then export certain items to one csv file and other to another file: using feed exports here I managed to do it for one file as follows: settings FEED_EXPORT_FIELDS = ( 'url', 'group_url', 'name', …

web-scraping scrapy scrapy-pipeline

asked Jun 28 '18 at 17:06

rafalf

425
7
16

1

vote

2 answers

Filter duplicate entries when exporting items with append mode in csvexports in Scrapy

I am trying to figure out how to pre-check whether an item is already present in a row in the csv file to be exported. If the item is not present, then the item needs to be appended. Else the item should be discarded. So far I have done following in…

python scrapy export-to-csv scrapy-pipeline

asked Mar 23 '18 at 06:03

Dhiraz Gazurel

104
1
10

1

vote

1 answer

give output file name inside the crawler scrapy

I have scrapy project written in python 3.6. and project have 3 crawlers it simply scrape items from 3 different website one crawler for each website. I am using item from items.py in script doing yield item each crawler has minor different in items…

python scrapy scrapy-pipeline

asked Mar 17 '18 at 12:53

itsmnthn

1,898
1
18
30

1

vote

0 answers

Items randomly never reach pipeline when iterating over a list

I'm fairly new to scrapy. I'm crawling a site that offers a list of websites. I want to crawl certain pages of those websites. Example: siteX.com has the following list: http://siteA.com http://siteB.com http://siteC.com All those pages have…

python scrapy scrapy-pipeline

asked Mar 14 '18 at 09:41

user3255061

1,757
1
30
50

1

vote

1 answer

Pipeline to remove None values

My spider yields certain data but sometimes it doesn't find the data. Instead of setting a condition such as below: if response.xpath('//div[@id="mitten"]//h1/text()').extract_first(): result['name'] =…

python python-2.7 scrapy scrapy-pipeline

asked Mar 05 '18 at 09:18

Casper

1,435
10
22

1

vote

2 answers

How To Keep/Export Field Items in Specific Order Per Spider Class Definition, Utilizing The Items Pipeline in Scrapy

I have a spider which exports data to different CSV files (per the names of the class definitions as defined in the spider class). However, I also wanted to keep the order of the fields in a specific order as they were being processed and exported…

python python-3.x scrapy scrapy-pipeline

asked Mar 01 '18 at 20:21

NeilR

46
7

1

vote

0 answers

Test or Mock Scrapy Pipeline

I was looking into testing the scrapy pipeline, (I already know the spider works) when it occurred to me I could just use a local copy of a page from the target website instead of repeatedly hitting it with my spider online. But I did not see…

scrapy scrapy-pipeline

asked Feb 23 '18 at 01:38

Malik A. Rumi

1,855
4
25
36

1

vote

1 answer

Invoke scrapy's custom exporter by command line

While trying to resolve my problem (output an ordered Json array by a specific item's field), I've received an answer that suggests me to create a custom exporter for the job. I'm creating one, but... all the examples that I've find suggest to call…

python web-scraping scrapy scrapy-pipeline scrapy-shell

asked Feb 22 '18 at 14:02

Lore

1,286
1
22
57

1

vote

1 answer

How to fetch data using scrapy?

I am working on a Django project and I want to provide some news feeds to the home page. I recently got interact with scrapy, when I run given code with "scarpy shell", this code is able to fetch the data successfully. But when I put this code into…

python scrapy scrapy-pipeline scrapy-shell

asked Dec 22 '17 at 08:11

jax

3,927
7
41
70

1

vote

0 answers

Using Scrapy JsonItemsLinesExporter, returns no value

I have multiple spiders within one scraping program, I am trying to run all spiders simultaneously out of a script and then dump the contents to a JSONfile. When I use the shell on each individual spider and do -o xyz.json it works fine. I've…

python scrapy scrapy-pipeline

asked Dec 07 '17 at 00:40

Artie

82
1
8

1

vote

2 answers

Access Instance of scrapy pipeline class

I want to access the variable self.cursor to make use of the active postgreSQL connection, but i am unable to figure out how to access the scrapy's instance of the pipeline class. class ScrapenewsPipeline(object): def open_spider(self, spider): …

python-3.x scrapy scrapy-pipeline

asked Dec 03 '17 at 06:34

atb00ker

957
13
24

1

vote

0 answers

Scrapy: manipulate same item multiple times on different functions before yield-ing

I have a spider that scrapes data from a webpage and writes the title, text and img url to mongoDB. I have two functions: def parse_news(self, response): item = NewsItem() item['_id'] = .. #key for MongoDB - Unique item['Title'] = .. …

python mongodb scrapy scrapy-pipeline

asked Nov 06 '17 at 10:24

endritius

11
3

1

vote

1 answer

How can I determine whether Scrapy encountered errors, in the Pipeline.close_spider() method?

I have a Scrapy spider and Pipeline setup. My Spider extracts data from a website and my Pipeline's process_item()method inserts the extracted data into a temporary database table. At the end, in the Pipeline's close_spider() method I run some error…

scrapy scrapy-pipeline

asked Oct 26 '17 at 23:36

buzz

326
3
7

1

vote

3 answers

Unable to pass empty url through scrapy pipeline

I have a list of data objects each of them containing a url to be scraped. Some of these urls are not valid but I still want the data object to fall through to reach item pipelines. After @tomáš-linhart reply I understood that using a middleware…

python-3.x scrapy scrapy-pipeline

asked Sep 22 '17 at 06:07

comiventor

3,922
5
50
77

1

vote

1 answer

Scrapy - condition based crawling

I have the following Scrapy parse method: def parse(self, response): item_loader = ItemLoader(item=MyItem(), response=response) for url in response.xpath('//img/@src').extract(): item_loader.add_value('image_urls',…

scrapy scrapy-pipeline

asked Jul 18 '17 at 15:21

Akustik

43
8

Questions tagged [scrapy-pipeline]