0

I am trying to scrape job data off of this xml feed.

I am having a problem where when I launch my spider, I get a valid 200 HTTP response for the start_url but I don't scrape any data; 0 pages and 0 items are scraped.

The node I am trying to iterate over is e:Entities, which contains the e:Entity job-data containing node.

I really cannot tell what I am doing wrong. I followed the scrapy guide on XMLFeedSpiders here to a tee.

I suspect it could has something to do with the XML being badly organized and somehow related to the numerous namespaces on the XML. Is there a problem with my namespaces?

I am almost positive I chose the correct iternodes value as well as the parse_node XPath selector.

Here is my XMLFeedSpider code.

class Schneider_XML_Spider(XMLFeedSpider):
    name = "Schneider"
    namespaces = [
        ('soap', 'http://schemas.xmlsoap.org/soap/envelope/'),
        ('ns1', 'http://www.taleo.com/ws/tee800/2009/01/find'),
        ('root', 'http://www.taleo.com/ws/tee800/2009/01/find'),
        ('e', 'http://www.taleo.com/ws/tee800/2009/01')
    ]

    allowed_domains = ['http://schneiderjobs.com']
    start_urls = ['http://schneiderjobs.com/driving-requisitions/scrape/1']    
    iterator ='iternodes' # use an XML iterator, called iternodes

    itertag = 'e:Entities' # loop over "e:Entities" "e:Entity" nodes

# parse_node gets called on every node within itertag,
def parse_node(self, response, node):
    print "we are now scraping"

    item = Schneider_XML_Spider.itemFile.XMLScrapyPrototypeItem()

    item['rid'] = node.xpath('e:Entity/e:ContestNumber').extract()
    print item['rid']
    return item

And here is my Execution log:

(note: i put some space around what I feel is the important part of the exe_log.)

    C:\PythonFiles\spidersClean\app\spiders\Scrapy\xmlScrapyPrototype\1.0\spiders>scrapy crawl Schneider
2015-02-18 10:31:46-0500 [scrapy] INFO: Scrapy 0.24.4 started (bot: schneiderXml)
2015-02-18 10:31:46-0500 [scrapy] INFO: Optional features available: ssl, http11
2015-02-18 10:31:46-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'spiders', 'SPIDER_MODULES': ['spiders'], 'BOT_NAME': 'schneiderXml'}
2015-02-18 10:31:46-0500 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-02-18 10:31:47-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMi
ddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-02-18 10:31:47-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-02-18 10:31:47-0500 [scrapy] INFO: Enabled item pipelines:

2015-02-18 10:31:47-0500 [Schneider] INFO: Spider opened
2015-02-18 10:31:47-0500 [Schneider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-02-18 10:31:47-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-02-18 10:31:47-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-02-18 10:31:52-0500 [Schneider] DEBUG: Crawled (200) <GET http://schneiderjobs.com/driving-requisitions/scrape/1> (referer: None)
2015-02-18 10:31:52-0500 [Schneider] INFO: Closing spider (finished)
2015-02-18 10:31:52-0500 [Schneider] INFO: Dumping Scrapy stats:

        {'downloader/request_bytes': 245,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 1360566,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 2, 18, 15, 31, 52, 126000),
         'log_count/DEBUG': 3,
         'log_count/INFO': 7,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2015, 2, 18, 15, 31, 47, 89000)}
2015-02-18 10:31:52-0500 [Schneider] INFO: Spider closed (finished)

I've looked at these other SO threads on XMLFeedSpiders and they have not helped.
How to scrape xml feed with xmlfeedspider

How to scrape xml urls with scrapy

Why isn't XMLFeedSpider failing to iterate through the designated nodes?

Has anyone solved a problem like this before?

Community
  • 1
  • 1
deusofnull
  • 91
  • 1
  • 11

1 Answers1

0

I've figured this one out!

I was assuming that the itertag worked by applying an \\xml-node xpath selector to find the node which was the target loop-parent node. In actuality, this is not the case.

You need to use an explicit XPath to the itertag you desire to loop over. In my case, the following code change has made my spider function:

class Schneider_XML_Spider(XMLFeedSpider):

name = "Schneider"

# get the links of all the namespaces so that the xml Selector knows how to handle each of them
namespaces = [
    ('soap', 'http://schemas.xmlsoap.org/soap/envelope/'),
    ('ns1', 'http://www.taleo.com/ws/tee800/2009/01/find'),
    ('root', 'http://www.taleo.com/ws/tee800/2009/01/find'),
    ('e', 'http://www.taleo.com/ws/tee800/2009/01')
]

# returns a scrapy.selector.Selector that process an XmlResponse
iterator = 'xml'

#point to the tag that  contains all the inner nodes you want to process
itertag = "XMP/soap:Envelope/soap:Body/ns1:findPartialEntitiesResponse/root:Entities"

allowed_domains = ['http://schneiderjobs.com']
start_urls = ['http://schneiderjobs.com/driving-requisitions/scrape/1']

You must make sure that all the namespaces that are found in the XPath to your itertag are defined in your namespaces array.

Also, if you're trying to get the inner text of a node in your parse_node method, make sure to remember to add /text() to the end of your XPath extractor. Ex:

item['rid'] = node.xpath('e:Entity/e:ContestNumber/text()').extract()

I hope this answer is helpful to anyone who comes along this question in the future.

deusofnull
  • 91
  • 1
  • 11