After years of reluctantly coding scrapers as a mish-mash of regexp and BeautifulSoup, etc. I found Scrapy, which I pretty much count as this year's Christmas present to myself! It is natural to use, and it seems to have been built to make practically everything elegant and reusable.
But I am in a situation I am not sure how to tackle: my spider crawls and scrapes a listing page A, from which I generate a set of items. But for each item, I need to fetch a distinct complementary link (constructed from some of the scraped information, but not explicitly a link on the page which Scrapy could follow) to obtain additional information.
My question is in two parts: what is the protocol to fetch an URL outside of the crawling process? how do I build items from several sources in an elegant way?
This has partially been asked (and answered) in a previous question on StackOverflow. But I am more interested in what the philosophy of Scrapy is supposed to be in this usage case---surely not an unforeseen possibility? I wonder if this is one of the things the Pipelines are destined to be used for (adding information from a secondary source deduced from the primary information is an instance "post-processing"), but what is the best way to do it, to not completely mess up Scrapy's efficient asynchronous organization?