0

Is there a way I can skip downloading a webpage but still have other parts of pipeline after it execute?

Currently, I read a file of json objects in start_requests, each json object has a website URL and other data fields. if a website URL is not empty, it yields a request object else skips it.

In another function parse, I create item object after which the database pipeline comes into picture.

I want to insert the other data fields even if website url is empty in which case start_requests doesn't create a request object.

comiventor
  • 3,922
  • 5
  • 50
  • 77
  • hmmmm... no to clear on whjat your working with (the json file in question)... But I cant see why you just cant parse through it in the parse... extract all working urls with some regex... and loop and set call backs accordingly? wish I could be more help lol – scriptso Aug 30 '17 at 20:02
  • parse works only when it receives the response. response gets generated only when request is processed. in case where urls are empty, request will not be generated and hence not processed. but i still want the associated data with the empty url to fall through the pipeline. I think my answer lies in using some form of DownloaderMiddleware where I override process_request and raise IgnoreRequest error – comiventor Aug 30 '17 at 20:14
  • AH!! gotcha! well... you can set a middleware to handle perticular response and have it then point to your pipeline though I cant say Ive ever done something of that sorts I know that you can lol – scriptso Aug 30 '17 at 20:18
  • LMAO Im just reading your comment and notice I just repeating what your said but to build ontop of that... your pipeling.... your using it for logging purposes? – scriptso Aug 30 '17 at 21:38
  • not for logging. i have multiple sources to obtain clean information. one of them being a webpage, other can be simple parsers etc. at the end, i want all data into pipeline – comiventor Aug 30 '17 at 21:47
  • I hate o be a stickler... lol no, just naturally inquisitive becasue I like where you minds at ... So 1) Your initial call to w.e your fist request, your extract a json file... 2) In this file, there are or for x reason the urls are.. just not there? or are down... ergo the middle ware to help you hadle behavior when receving the response code you set in setting .... Now what throws me off.. so you say you alsso have enabled dowload middleware? I imagine part of the your projects need fuctionallities... dude, it scrapy docs they covver the http middle wares, probly should look that up – scriptso Aug 31 '17 at 06:59
  • If you now have A source for when your initial json has none/or pages that are down or inaccesible the you can deal with thebehavior when encountered with x responce... ...If I may create some food for though... amalgamte ALL working links through scraping them for it and or regex you way creating a multispider instance... agian I have no Idea really what your project flows like so really =/ much luck – scriptso Aug 31 '17 at 07:06
  • I will have to read about downloadermiddleware and write one as per my requirements. thats all :) – comiventor Aug 31 '17 at 08:04
  • Maybe you are doing something simple a very complicated task. My two cents: Scrapy is for "download pages in a fashion way", right? Use Scrapy for it. Have a script splitting your input in two: what goes to Scrapy (data with url) and what doesn't (data without url). Move your cleaner code from the pipeline to a third script that cleans all your data and puts it in a DB. Oh,,, and you can also open your json file in one of the pipelines and you will have the data you want!! – Djunzu Sep 05 '17 at 20:55

0 Answers0