0

I am trying to scrape an xml file with the below format

file_sample.xml:

<rss version="2.0">
 <channel>
   <item>
       <title>SENIOR BUDGET ANALYST (new)</title>
       <link>https://hr.example.org/psp/hrapp&SeqId=1</link>
       <pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate>
       <category>All Open Jobs</category>
   </item>
   <item>
       <title>BUDGET ANALYST (healthcare)</title>
       <link>https://hr.example.org/psp/hrapp&SeqId=2</link>
       <pubDate>Wed, 18 Jul 2012 04:00:00 GMT</pubDate>
       <category>All category</category>
   </item>
 </channel>
</rss>

Below is my spider.py code

class TestSpider(XMLFeedSpider):
    name = "testproject"
    allowed_domains = {"www.example.com"}
    start_urls = [
        "https://www.example.com/hrapp/rss/careers_jo_rss.xml"
        ]
    iterator = 'iternodes'
    itertag = 'channel'


    def parse_node(self, response, node):
        title = node.select('item/title/text()').extract()
        link  = node.select('item/link/text()').extract()
        pubdate  = node.select('item/pubDate/text()').extract()
        category  = node.select('item/category/text()').extract()
        item = TestprojectItem()
        item['title'] = title
        item['link'] = link
        item['pubdate'] = pubdate
        item['category'] = category
        return item

Result:

2012-07-25 13:24:14+0530 [testproject] DEBUG: Scraped from <200 https://hr.templehealth.org/hrapp/rss/careers_jo_rss.xml>
    {'title': [u'SENIOR BUDGET ANALYST (hospital/healthcare)',
               u'BUDGET ANALYST'],
     'link': [u'https://hr.example.org/psp/hrapp&SeqId=1',
               u'https://hr.example.org/psp/hrapp&SeqId=2'] 
     'pubdate': [u'Wed, 18 Jul 2012 04:00:00 GMT',
               u'Wed, 18 Jul 2012 04:00:00 GMT'] 
     'category': [u'All Open Jobs',
               u'All category'] 
      }

here as u can observe from the above result, all the results from the corresponding tags are combined in to single list, but i want to map according to their individual item tag like below as we do it for html scraping.

    {'title': u'SENIOR BUDGET ANALYST (hospital/healthcare)'
     'link': u'https://hr.example.org/psp/hrapp&SeqId=1'
     'pubdate': u'Wed, 18 Jul 2012 04:00:00 GMT'
     'category': u'All Open Jobs'
      }
    {'title': u'BUDGET ANALYST'
     'link': u'https://hr.example.org/psp/hrapp&SeqId=2' 
     'pubdate': u'Wed, 18 Jul 2012 04:00:00 GMT'
     'category': u'All category'
      }

How can we scrape xml tag data according to separate main tag like item tag above.

Thanks in advance.............

Shiva Krishna Bavandla
  • 25,548
  • 75
  • 193
  • 313

3 Answers3

4

Try changing your itertag from itertag = 'channel' to 'itertag = 'item'

bzudo
  • 43
  • 3
2

Just change itertag = 'item'.

If you refer to the documentation of parse_node method, it states that the method is called for the nodes matching the provided tag name (itertag). In you case it is 'item'(child node to 'channel' rootnode).

rocketkid
  • 31
  • 2
0

I recommend the use of feedparser:

feedparser.parse(url)

results in

{'bozo': 1,
 'bozo_exception': xml.sax._exceptions.SAXParseException("EntityRef: expecting ';'\n"),
 'encoding': u'utf-8',
 'entries': [{'link': u'https://hr.example.org/psp/hrapp&SeqId=1',
   'links': [{'href': u'https://hr.example.org/psp/hrapp&SeqId=1',
     'rel': u'alternate',
     'type': u'text/html'}],
   'tags': [{'label': None, 'scheme': None, 'term': u'All Open Jobs'}],
   'title': u'SENIOR BUDGET ANALYST (new)',
   'title_detail': {'base': u'',
    'language': None,
    'type': u'text/plain',
    'value': u'SENIOR BUDGET ANALYST (new)'},
   'updated': u'Wed, 18 Jul 2012 04:00:00 GMT',
   'updated_parsed': time.struct_time(tm_year=2012, tm_mon=7, tm_mday=18, tm_hour=4, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=200, tm_isdst=0)},
  {'link': u'https://hr.example.org/psp/hrapp&SeqId=2',
   'links': [{'href': u'https://hr.example.org/psp/hrapp&SeqId=2',
     'rel': u'alternate',
     'type': u'text/html'}],
   'tags': [{'label': None, 'scheme': None, 'term': u'All category'}],
   'title': u'BUDGET ANALYST (healthcare)',
   'title_detail': {'base': u'',
    'language': None,
    'type': u'text/plain',
    'value': u'BUDGET ANALYST (healthcare)'},
   'updated': u'Wed, 18 Jul 2012 04:00:00 GMT',
   'updated_parsed': time.struct_time(tm_year=2012, tm_mon=7, tm_mday=18, tm_hour=4, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=200, tm_isdst=0)}],
 'feed': {},
 'namespaces': {},
 'version': u'rss20'}
  • Wow! thanks very much that helped a lot. But actually we need to select the the respective values of the tags by searching right. But is there anyway to display the tags and tag values only – Shiva Krishna Bavandla Jul 25 '12 at 10:12
  • Doy you mean something like this: `[entry.tags[0]["term"] for entry in feedparser.parse(open("/tmp/feed.rss")).entries]` => `[u'All Open Jobs', u'All category']`? –  Jul 25 '12 at 10:16
  • No actually what my idea is when we run a xml url with some code it should automatically parse the tags and map with their values dynamically(Irrespective of knowing the tags and retrieving the values from the tags manually), but the above code u have given is acceptable. – Shiva Krishna Bavandla Jul 25 '12 at 10:25
  • Whether there are any other libraries or modules for doing the same – Shiva Krishna Bavandla Jul 25 '12 at 10:29