Questions tagged [scrapy-pipeline]

218 questions
1
vote
2 answers

Scrapy Pipeline to Parse

I made a pipeline to put scrapy data to my Parse Backend PARSE = 'api.parse.com' PORT = 443 However, I can't find the right way to post the data in Parse. Because everytime it creates undefined objects in my Parse DB. class…
1
vote
1 answer

Scrapy spider that get two pictures at same page then names them differently

I'm new both to Python and Scrapy so I'm not sure I've chosen the best method for doing this; but my aim is to get two (or more) different pictures at a page and naming the pictures differently. How should I set up the pipeline, should I do a…
brrrglund
  • 51
  • 1
  • 8
1
vote
0 answers

Scrapy - Invoke a new crawling process when a crawler finishes

I search for urls - xxx.com/a, xxx.com/b etc as found from two start_urls xxx.com/LISTA and xxx/com/LISTB Once this crawler has finished I want to also additionally crawl pages xxx.com/x_in_database and xxx.com/y_in_database - whose URLs were…
dowjones123
  • 3,695
  • 5
  • 40
  • 83
1
vote
1 answer

Pipeline for item not JSON serializable

I am trying to write output of a scraped xml to json. The scrape fails due to an item not being serializable. From this question its advised that you need to build a pipeline, answer not provided out of scope for question SO scrapy serializer So…
sayth
  • 6,696
  • 12
  • 58
  • 100
1
vote
1 answer

Cannot download images from website with scrapy

I'm starting with Scrapy in order to automatize file downloading from websites. As a test, I want to download the jpg files from this website. My code is based on the intro tutorial and the Files and Images Pipeline tutorial on the Scrapy…
luchonacho
  • 6,759
  • 4
  • 35
  • 52
0
votes
2 answers

How to merge results of nested scrapy requests into a single item?

I have a url that has a bunch of universities. For every university, there is a link to a list of scholarships that is provided by the university. Inside this link (that contains a list of scholarships), there is a link to a detailed information on…
0
votes
1 answer

Google BigQuery Update is 70x slower then Insert. How to fix?

Im using BigQuery as my DB with Scrapy spider. Below are 2 pipelines to store data into DB. One uses Insert, another Update methods. The Update method is 70 times slower then insert (merely 20 updated records per minute). Update take 3.560 seconds…
0
votes
1 answer

Scrapy: passing instance variables between pipelines

Does passing spider instance variables between pipelines work? Unfortunately I do not have the code but I'll try to explain as short and clear as possible. Order is the following: Pipeline_1: high priority (@700) Pipeline_2: low priority (@900) In…
The Doctor
  • 17
  • 5
0
votes
0 answers

Scrapy signals not connecting to class methods

I've defined a Crawler class for crawling multiple spiders from script. For spiders, instead of using pipelines, I defined a class, CrawlerPipeline and used signals for connecting methods. In CrawlerPipeline, some methods require to use class…
rish_hyun
  • 451
  • 1
  • 7
  • 13
0
votes
1 answer

Scrapy item enriching from multiple websites

I implemented the following scenario with python scrapy framework: class MyCustomSpider(scrapy.Spider): def __init__(self, name=None, **kwargs): super().__init__(name, **kwargs) self.days = getattr(self, 'days', None) def…
Gandalf
  • 155
  • 1
  • 12
0
votes
1 answer

Scrapy item import error: No module found

I'm new to Scrapy and to python and when I try to import a class from items.py in VS Code I get the following error: Exception has occurred: ModuleNotFoundError No module named 'scraper.items'; 'scraper' is not a package My folder structure: Folder…
Ego0r
  • 9
  • 1
0
votes
1 answer

List elements retrieved by Xpath in scrapy do not output correctly item by item(for,yield)

I am outputting the URL of the first page of the order results page of an exhibitor extracted from a specific EC site to a csv file, reading it in start_requests, and looping through it with a for statement. Each order result page contains…
K_MM
  • 35
  • 5
0
votes
1 answer

How to save Scrapy Broad Crawl Results?

Scrapy has a built-in way of persisting results in AWS S3 using the FEEDS setting. but for a broad crawl over different domains this would create a single file, where the results from all domains are saved. how could I save the results of each…
NightOwl
  • 1,069
  • 3
  • 13
  • 23
0
votes
1 answer

How to yield item from RFPDupeFilter or CustomFiler

I'm using Scrapy to crawl pages from different websites. With every scrapy.Request() I set some meta data which is used to yield an item. It's also possible that my code yields multiple scrapy.Request() for same url however with different…
Kiran Kyle
  • 99
  • 11