I am trying to scrape job data off of this xml feed.
I am having a problem where when I launch my spider, I get a valid 200 HTTP response for the start_url but I don't scrape any data; 0 pages and 0 items are scraped.
The node I am trying to iterate over is e:Entities
, which contains the e:Entity
job-data containing node.
I really cannot tell what I am doing wrong. I followed the scrapy guide on XMLFeedSpiders here to a tee.
I suspect it could has something to do with the XML being badly organized and somehow related to the numerous namespaces on the XML. Is there a problem with my namespaces?
I am almost positive I chose the correct iternodes value as well as the parse_node XPath selector.
Here is my XMLFeedSpider code.
class Schneider_XML_Spider(XMLFeedSpider):
name = "Schneider"
namespaces = [
('soap', 'http://schemas.xmlsoap.org/soap/envelope/'),
('ns1', 'http://www.taleo.com/ws/tee800/2009/01/find'),
('root', 'http://www.taleo.com/ws/tee800/2009/01/find'),
('e', 'http://www.taleo.com/ws/tee800/2009/01')
]
allowed_domains = ['http://schneiderjobs.com']
start_urls = ['http://schneiderjobs.com/driving-requisitions/scrape/1']
iterator ='iternodes' # use an XML iterator, called iternodes
itertag = 'e:Entities' # loop over "e:Entities" "e:Entity" nodes
# parse_node gets called on every node within itertag,
def parse_node(self, response, node):
print "we are now scraping"
item = Schneider_XML_Spider.itemFile.XMLScrapyPrototypeItem()
item['rid'] = node.xpath('e:Entity/e:ContestNumber').extract()
print item['rid']
return item
And here is my Execution log:
(note: i put some space around what I feel is the important part of the exe_log.)
C:\PythonFiles\spidersClean\app\spiders\Scrapy\xmlScrapyPrototype\1.0\spiders>scrapy crawl Schneider
2015-02-18 10:31:46-0500 [scrapy] INFO: Scrapy 0.24.4 started (bot: schneiderXml)
2015-02-18 10:31:46-0500 [scrapy] INFO: Optional features available: ssl, http11
2015-02-18 10:31:46-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'spiders', 'SPIDER_MODULES': ['spiders'], 'BOT_NAME': 'schneiderXml'}
2015-02-18 10:31:46-0500 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-02-18 10:31:47-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMi
ddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-02-18 10:31:47-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-02-18 10:31:47-0500 [scrapy] INFO: Enabled item pipelines:
2015-02-18 10:31:47-0500 [Schneider] INFO: Spider opened
2015-02-18 10:31:47-0500 [Schneider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-02-18 10:31:47-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-02-18 10:31:47-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-02-18 10:31:52-0500 [Schneider] DEBUG: Crawled (200) <GET http://schneiderjobs.com/driving-requisitions/scrape/1> (referer: None)
2015-02-18 10:31:52-0500 [Schneider] INFO: Closing spider (finished)
2015-02-18 10:31:52-0500 [Schneider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 245,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 1360566,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 2, 18, 15, 31, 52, 126000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 2, 18, 15, 31, 47, 89000)}
2015-02-18 10:31:52-0500 [Schneider] INFO: Spider closed (finished)
I've looked at these other SO threads on XMLFeedSpiders and they have not helped.
How to scrape xml feed with xmlfeedspider
How to scrape xml urls with scrapy
Why isn't XMLFeedSpider failing to iterate through the designated nodes?
Has anyone solved a problem like this before?