0

I'm working on a project to scrape statistics from Fantasy Football leagues across various services, and Yahoo is the one I'm stuck at currently. I want my spider to crawl the Draft Results page of a public Yahoo league. When I run the spider, it gives me no results, and no error message either. It simply says:

2012-09-14 17:29:08-0700 [draft] DEBUG: Crawled (200) <GET http://football.fantasysports.yahoo.com/f1/753697/draftresults?drafttab=round> (referer: None)
2012-09-14 17:29:08-0700 [draft] INFO: Closing spider (finished)
2012-09-14 17:29:08-0700 [draft] INFO: Dumping spider stats:
    {'downloader/request_bytes': 250,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 48785,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 9, 15, 0, 29, 8, 734000),
     'scheduler/memory_enqueued': 1,
     'start_time': datetime.datetime(2012, 9, 15, 0, 29, 7, 718000)}
2012-09-14 17:29:08-0700 [draft] INFO: Spider closed (finished)
2012-09-14 17:29:08-0700 [scrapy] INFO: Dumping global stats:
    {}

It's not a login issue, because the page in question is accessible without being signed in. I see from other questions posted here that people have gotten scrapes to work for other parts of Yahoo. Is it possible that Yahoo Fantasy is blocking spiders? I've successfully written one for ESPN already, so I don't think the issue is with my code. Here it is anyway:

class DraftSpider(CrawlSpider):
name = "draft"
#psycopg stuff here

rows = ["753697"]

allowed_domains = ["football.fantasysports.yahoo.com"]

start_urls = []

for row in rows:

    start_urls.append("http://football.fantasysports.yahoo.com/f1/" + "%s" % (row) + "/draftresults?drafttab=round")

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("/html/body/div/div/div/div/div/div/div/table/tr")
        items = []
        for site in sites:
            item = DraftItem()
            item['pick_number'] = site.select("td[@class='first']/text()").extract()
            item['pick_player'] = site.select("td[@class='player']/a/text()").extract()
            item['pick_nflteam'] = site.select("td[@class='player']/span/text()").extract()
            item['pick_ffteam'] = site.select("td[@class='last']/@title").extract()
            items.append(item)
        return items

Would really appreciate any insight on this.

ckz
  • 5
  • 2
  • 1. override `start_requests` instead of filling `start_urls`; 2. debug your code. put some prints to follow the logic: does it get to the `parse` method? does the xpath query work? – warvariuc Sep 15 '12 at 16:56
  • 1. Try `scrapy shell ` to check whether the XPath selector works. 2. Using `CrawlSpider` with a custom `parse` method doesn't make sense, as CrawlSpider has its own `parse` definition. `BaseSpider` would be a better fit. 3. Additionally, it's probably only the indentation, but above it seems like you are defining the `parse` method inside the `for` loop, overwriting it on every iteration. – Daniel Werner Sep 15 '12 at 17:30

1 Answers1

1
C:\Users\Akhter Wahab>scrapy shell http://football.fantasysports.yahoo.com/f1/75
In [1]: hxs.select("/html/body/div/div/div/div/div/div/div/table/tr")
Out[1]: []

your absolute Xpath is not right "/html/body/div/div/div/div/div/div/div/table/tr"

as well as i will never recommend you to use absolute Xpath , but you should use some relative xpath like all results are in

//div[@id='drafttables']

this div. so you can start getting results.

akhter wahab
  • 4,045
  • 1
  • 25
  • 47