5

After years of reluctantly coding scrapers as a mish-mash of regexp and BeautifulSoup, etc. I found Scrapy, which I pretty much count as this year's Christmas present to myself! It is natural to use, and it seems to have been built to make practically everything elegant and reusable.

But I am in a situation I am not sure how to tackle: my spider crawls and scrapes a listing page A, from which I generate a set of items. But for each item, I need to fetch a distinct complementary link (constructed from some of the scraped information, but not explicitly a link on the page which Scrapy could follow) to obtain additional information.

My question is in two parts: what is the protocol to fetch an URL outside of the crawling process? how do I build items from several sources in an elegant way?

This has partially been asked (and answered) in a previous question on StackOverflow. But I am more interested in what the philosophy of Scrapy is supposed to be in this usage case---surely not an unforeseen possibility? I wonder if this is one of the things the Pipelines are destined to be used for (adding information from a secondary source deduced from the primary information is an instance "post-processing"), but what is the best way to do it, to not completely mess up Scrapy's efficient asynchronous organization?

Community
  • 1
  • 1
Jérémie
  • 1,353
  • 9
  • 19
  • 1
    what do you mean bt > to fetch an URL outside of the crawling process? – warvariuc Aug 04 '12 at 17:57
  • @warwaruk: crawling typically has a (set of) starting URLs, fetches those pages and adds the links it finds to the queue. By "outside of the crawling process" I mean fetching a link that is neither in the starting set, nor given as an anchor in the pages which are fetched---I mean an URL that is algorithmically deduced from the information that is scraped. – Jérémie Aug 04 '12 at 18:03
  • @warwaruk: thank you, that's perfect! I have another question: what is the best practice when scraping from multiple sites that cannot be parsed the same way (say, Amazon's listings and Walmart's listings) but which I want to aggregate? Should I do several spiders? Is it possible, within the same spider, to have several callbacks depending on domain? – Jérémie Aug 05 '12 at 14:20
  • 1
    i usually do different spiders, as parse method logic is different. It is more clear. Also different sites might need different (per spider) settings like download delay, etc. If you have common functionality - make your own base spider, with common methods. – warvariuc Aug 05 '12 at 14:21
  • You are welcome! I've deleted some of the comments and put them in the answer. – warvariuc Aug 05 '12 at 14:35

1 Answers1

2

what is the protocol to fetch an URL outside of the crawling process?

When you create a Request giving it an url, it doesn't matter where you've taken the url to download from. You can extract it from the page, or build somehow else.

how do I build items from several sources in an elegant way?

Use Request.meta

Community
  • 1
  • 1
warvariuc
  • 57,116
  • 41
  • 173
  • 227
  • One last question, assuming I have multiple spiders, is it possible that all items be fed into the same pipeline, and then use the [technique described in the manual](http://doc.scrapy.org/en/0.14/topics/item-pipeline.html#item-pipeline-example-with-resources-per-spider) to merge "duplicates"? – Jérémie Aug 05 '12 at 14:36
  • I don't understand - need more info. I suggest creating a new question, if it's not related to this one. – warvariuc Aug 05 '12 at 14:39