sitemap xml parsing in python 3.x

Question

My xml structure are bellow

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
    <url>
        <loc>hello world 1</loc>
        <image:image>
            <image:loc>this is image loc 1</image:loc>
            <image:title>this is image title 1</image:title>
        </image:image>
        <lastmod>2019-06-19</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.25</priority>
    </url>
    <url>
        <loc>hello world 2</loc>
        <image:image>
            <image:loc>this is image loc 2</image:loc>
            <image:title>this is image title 2</image:title>
        </image:image>
        <lastmod>2020-03-19</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.25</priority>
    </url>
</urlset>

i want to get only

hello world 1
hello world 2

My python code is bellow:

import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()

for url in root.findall('url'):
    loc = url.find('loc').text
    print(loc)

unfortunately it gives me nothing.

But when I change my xml to

<urlset>
    <url>
        <loc>hello world 1</loc>
        <lastmod>2019-06-19</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.25</priority>
    </url>
    <url>
        <loc>hello world 2</loc>
        <lastmod>2020-03-19</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.25</priority>
    </url>
</urlset>

it gives me correct result.

hello world 1
hello world 2

What can i do to get correct result without changing my xml? Because it doesn't make any sense to modify a 10000+ lines of file.

TIA

score 4 · Accepted Answer · answered May 25 '20 at 12:46

The (inelegant) fix to your code is:

import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()

# In find/findall, prefix namespaced tags with the full namespace in braces
for url in root.findall('{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
    loc = url.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text
    print(loc)

This is because you have to qualify you tag names with the namespace under which your XML is defined. The details on how use the find and findall methods with namespaces are from Parse XML namespace with Element Tree findall

Mirko · Answer 2 · 2020-12-25T02:55:10.093

If you don't want to mess with namespace, this is simpler solution then the accepted answer, and bit more elegant, using generic xpath query:

import lxml.etree


tree = lxml.etree.parse('test.xml')

for url in tree.xpath("//*[local-name()='loc']/text()"):
    print(url)

If you prefer to utilise xml namespace, you should do it this way:

import lxml.etree


tree = lxml.etree.parse('test.xml')

namespaces = {
    'sitemapindex': 'http://www.sitemaps.org/schemas/sitemap/0.9',
}

for url in tree.xpath("//sitemapindex:loc/text()", namespaces=namespaces):
    print(url)

If you prefer to load xml data directly from memory, instead from file, you can use lxml.etree.fromstring instead of lxml.etree.parse.

sitemap xml parsing in python 3.x

2 Answers2