Using Scrapy to parse site, follow Next Page, write as XML

Question

My script works wonderfully when I comment one piece of code: return items.

Here is my code, changing to http://example.com since it appears that is what other people to possibly to preserve the 'scraping' legality issues.

class Vfood(CrawlSpider):
        name = "example.com"
        allowed_domains = [ "example.com" ]
        start_urls = [
                "http://www.example.com/TV_Shows/Show/Episodes"
        ]

        rules = (
                Rule(SgmlLinkExtractor(allow=('example\.com', 'page='), restrict_xpaths = '//div[@class="paginator"]/
span[@id="next"]'), callback='parse'),
        )

        def parse(self, response):
                hxs = HtmlXPathSelector(response)
                items = []
                countries = hxs.select('//div[@class="index-content"]')
                tmpNextPage = hxs.select('//div[@class="paginator"]/span[@id="next"]/a/@href').extract()
                for country in countries:
                        item = FoodItem()
                        countryName = country.select('.//h3/text()').extract()
                        item['country'] = countryName
                        print "Country Name: ", countryName
                        shows = country.select('.//div[@class="content1"]')
                        for show in shows.select('.//div'):
                                showLink = (show.select('.//h4/a/@href').extract()).pop()
                                showLocation = show.select('.//h4/a/text()').extract()
                                showText = show.select('.//p/text()').extract()
                                item['showURL'] = "http://www.travelchannel.com"+str(showLink)
                                item['showcity'] = showLocation
                                item['showtext'] = showText
                                item['showtext'] = showText
                                print "\t", showLink
                                print "\t", showLocation
                                print "\t", showText
                                print "\n"
                                items.append(item)
                        **#return items**

                for NextPageLink in tmpNextPage:
                        m = re.search("Location", NextPageLink)
                        if m:
                                NextPage = NextPageLink
                                print "Next Page:  ", NextPage
                                yield Request("http://www.example.com/"+NextPage, callback = self.parse)
                        else:
                                NextPage = 'None'
SPIDER = food()

If I UNCOMMENT the #return items, I get the following error:

yield Request("http://www.example.com/"+NextPage, callback = self.parse)
SyntaxError: 'return' with argument inside generator

By leaving the comment there, I am unable to collect the data in XML format, but by the result of the print statements, I do see everything that I am supposed to on the screen.

my command for getting xml out:

scrapy crawl example.com --set FEED_URI=food.xml --set FEED_FORMAT=xml

I get the XML file creation when I UNCOMMENT the return items line above, but the script stops and won't follow the links.

score 4 · Answer 1 · answered Jun 30 '11 at 11:28

4

You're returning a list of items (probably in the wrong place) and later in the same function you are using yield to yield requests. You can't mix yield and return like this in python.

Either add everything to a list and return it at the end of your parse method or use yield everywhere. My suggestion is to replace items.append(item) with yield item and remove all references to the items list.

answered Jun 30 '11 at 11:28

Shane Evans

2,234
16
15

Awesome!! This solution worked for me. Thank you! I had to add a `return` at the end before the last line for the script to crawl properly. – Geo99M6Z Jun 30 '11 at 16:18

user · Answer 2 · 2011-06-30T12:15:36.170

Does this answer your question : http://www.answermysearches.com/python-fixing-syntaxerror-return-with-argument-inside-generator/354/

This error is telling you that when you use a yield inside of a function making it a generator, you can only use return with no arguments.

I'd also suggest using item loaders like this

def parse(self, response):
    l = XPathItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_xpath('stock', '//p[@id="stock"]')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()

I did come across the suggested URL in my searches, but at the time I didn't understand the issue. I think everything just clicked to what `yield` does. Thank you for your reply, I will attempt to implement the item loaders as you suggest once I _evolve_ :) Thanks again for the reply. — Geo99M6Z, Jun 30 '11 at 16:22

score 1 · Answer 3 · answered Jun 30 '11 at 23:00

1

The CrawlSpider class uses the Parse method, so you should name your specific parse method something else, like parse_item(). See "Crawling Rules" http://doc.scrapy.org/topics/spiders.html#scrapy.spider.BaseSpider.

answered Jun 30 '11 at 23:00

emish

2,813
5
28
34

he uses callback in rules to change the name. – llazzaro Jun 30 '11 at 23:22

Using Scrapy to parse site, follow Next Page, write as XML

3 Answers3