Questions tagged [scrapy-pipeline]

218 questions
1
vote
2 answers

Scrapy: export parsed data into multiple files

Id like to parse pages and then export certain items to one csv file and other to another file: using feed exports here I managed to do it for one file as follows: settings FEED_EXPORT_FIELDS = ( 'url', 'group_url', 'name', …
rafalf
  • 425
  • 7
  • 16
1
vote
2 answers

Filter duplicate entries when exporting items with append mode in csvexports in Scrapy

I am trying to figure out how to pre-check whether an item is already present in a row in the csv file to be exported. If the item is not present, then the item needs to be appended. Else the item should be discarded. So far I have done following in…
Dhiraz Gazurel
  • 104
  • 1
  • 10
1
vote
1 answer

give output file name inside the crawler scrapy

I have scrapy project written in python 3.6. and project have 3 crawlers it simply scrape items from 3 different website one crawler for each website. I am using item from items.py in script doing yield item each crawler has minor different in items…
itsmnthn
  • 1,898
  • 1
  • 18
  • 30
1
vote
0 answers

Items randomly never reach pipeline when iterating over a list

I'm fairly new to scrapy. I'm crawling a site that offers a list of websites. I want to crawl certain pages of those websites. Example: siteX.com has the following list: http://siteA.com http://siteB.com http://siteC.com All those pages have…
user3255061
  • 1,757
  • 1
  • 30
  • 50
1
vote
1 answer

Pipeline to remove None values

My spider yields certain data but sometimes it doesn't find the data. Instead of setting a condition such as below: if response.xpath('//div[@id="mitten"]//h1/text()').extract_first(): result['name'] =…
Casper
  • 1,435
  • 10
  • 22
1
vote
2 answers

How To Keep/Export Field Items in Specific Order Per Spider Class Definition, Utilizing The Items Pipeline in Scrapy

I have a spider which exports data to different CSV files (per the names of the class definitions as defined in the spider class). However, I also wanted to keep the order of the fields in a specific order as they were being processed and exported…
NeilR
  • 46
  • 7
1
vote
0 answers

Test or Mock Scrapy Pipeline

I was looking into testing the scrapy pipeline, (I already know the spider works) when it occurred to me I could just use a local copy of a page from the target website instead of repeatedly hitting it with my spider online. But I did not see…
Malik A. Rumi
  • 1,855
  • 4
  • 25
  • 36
1
vote
1 answer

Invoke scrapy's custom exporter by command line

While trying to resolve my problem (output an ordered Json array by a specific item's field), I've received an answer that suggests me to create a custom exporter for the job. I'm creating one, but... all the examples that I've find suggest to call…
Lore
  • 1,286
  • 1
  • 22
  • 57
1
vote
1 answer

How to fetch data using scrapy?

I am working on a Django project and I want to provide some news feeds to the home page. I recently got interact with scrapy, when I run given code with "scarpy shell", this code is able to fetch the data successfully. But when I put this code into…
jax
  • 3,927
  • 7
  • 41
  • 70
1
vote
0 answers

Using Scrapy JsonItemsLinesExporter, returns no value

I have multiple spiders within one scraping program, I am trying to run all spiders simultaneously out of a script and then dump the contents to a JSONfile. When I use the shell on each individual spider and do -o xyz.json it works fine. I've…
Artie
  • 82
  • 1
  • 8
1
vote
2 answers

Access Instance of scrapy pipeline class

I want to access the variable self.cursor to make use of the active postgreSQL connection, but i am unable to figure out how to access the scrapy's instance of the pipeline class. class ScrapenewsPipeline(object): def open_spider(self, spider): …
atb00ker
  • 957
  • 13
  • 24
1
vote
0 answers

Scrapy: manipulate same item multiple times on different functions before yield-ing

I have a spider that scrapes data from a webpage and writes the title, text and img url to mongoDB. I have two functions: def parse_news(self, response): item = NewsItem() item['_id'] = .. #key for MongoDB - Unique item['Title'] = .. …
endritius
  • 11
  • 3
1
vote
1 answer

How can I determine whether Scrapy encountered errors, in the Pipeline.close_spider() method?

I have a Scrapy spider and Pipeline setup. My Spider extracts data from a website and my Pipeline's process_item()method inserts the extracted data into a temporary database table. At the end, in the Pipeline's close_spider() method I run some error…
buzz
  • 326
  • 3
  • 7
1
vote
3 answers

Unable to pass empty url through scrapy pipeline

I have a list of data objects each of them containing a url to be scraped. Some of these urls are not valid but I still want the data object to fall through to reach item pipelines. After @tomáš-linhart reply I understood that using a middleware…
comiventor
  • 3,922
  • 5
  • 50
  • 77
1
vote
1 answer

Scrapy - condition based crawling

I have the following Scrapy parse method: def parse(self, response): item_loader = ItemLoader(item=MyItem(), response=response) for url in response.xpath('//img/@src').extract(): item_loader.add_value('image_urls',…
Akustik
  • 43
  • 8