0

I have a xml file as below

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url>
  <loc>https://ezinearticles.com/</loc>
  <changefreq>hourly</changefreq>
  <priority>1.0</priority>
 </url>
 <url>
  <loc>https://ezinearticles.com/submit/</loc>
  <changefreq>weekly</changefreq>
  <priority>0.3</priority>
 </url>
 ...................

I want to use xpathin lxml module to get URL from all tag. I implemented it as below code but it didn't work. The result is empty list

from lxml import etree
parser = etree.XMLParser(ns_clean=True)
xmlfile = "sitemap1.xml"
xmlobj = etree.parse(xmlfile, parser)

loc = xmlobj.xpath('//loc[text()]')

print(loc)

Can anyone help me fix my script ?

Le Truong Sinh
  • 181
  • 1
  • 1
  • 8

1 Answers1

1
# define a namespace map
nsmap={'s': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

# use it in your query
loc = xmlobj.xpath('//s:loc[text()]', namespaces=nsmap)

In your original code, you were looking for a loc (in the default namespace), but the element is actually a {http://www.sitemaps.org/schemas/sitemap/0.9}loc (because the xmlns= means that everything below it uses that namespace by default), which is why the original query didn't match.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • Try to get loc with "priority = 1" by code : loc = xmlobj.xpath('//s:url[priority=1]/loc/text()', namespaces=nsmap), but get empty string, do you know why ? – Le Truong Sinh Jul 06 '16 at 16:08
  • `//s:url[s:priority=1]/s:loc/text()`, assuming that everything but the namespaces is right. – Charles Duffy Jul 06 '16 at 16:41