2

Given the following XML

<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom">
  <id>1</id>
  <title>Example XML</title>
  <published>2021-12-15T00:00:00Z</published>
  <updated>2022-01-06T12:44:47Z</updated>
  <content type="application/xml">
    <articleDoc xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  chemaVersion="1.8" xml:lang="en">
      <articleDocHead>
        <itemInfo/>
      </articleDocHead>
    </articleDoc>
  </content>
</entry>

How can I get the value of the xml:lang attribute in entry/content/articleDoc attribute? I've checked the Python Docs but it unfortunately doesn't cover attributes with namespaces. The solution if found by manually writing the namespace in front of the attribute-name as a dictionary key seems wrong. I'm working with Python 3.9.9.

Here's my code so far:

import xml.etree.cElementTree as tree

xml = """<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom">
  <id>1</id>
  <title>Example XML</title>
  <published>2021-12-15T00:00:00Z</published>
  <updated>2022-01-06T12:44:47Z</updated>
  <content type="application/xml">
    <articleDoc xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" schemaVersion="1.8" xml:lang="en">
      <articleDocHead>
        <itemInfo/>
      </articleDocHead>
    </articleDoc>
  </content>
</entry>"""
ns = {'nitf': 'http://iptc.org/std/NITF/2006-10-18/',
      'w3': 'http://www.w3.org/2005/Atom',
      'xml': 'http://www.w3.org/XML/1998/namespace'}
root = tree.fromstring(xml)
id = root.find("w3:id", ns).text # works
print(id)
type_attribute = root.find("w3:content", ns).attrib['type'] # works
print(type_attribute)

#language = root.find("w3:content/articleDoc/articleDocHeader[xml:lang']", ns) # doesn't work
language = root.find("w3:content/articleDoc", ns).attrib['{http://www.w3.org/XML/1998/namespace}lang'] # works, but seems wrong
print(language)

Any help is appreciated. Thanks a lot!

Philip Koch
  • 197
  • 1
  • 11
  • https://stackoverflow.com/a/61781919/407651 unfortunately does not answer my question since I need to extract the value from the attribute after finding the element. Or does it mean there's no better way than to hardcode the .attrib['{http://www.w3.org/XML/1998/namespace}lang'] string for each attribute? – Philip Koch Jan 07 '22 at 11:32
  • 1
    OK, see https://stackoverflow.com/a/62368982/407651. It may look a little clumsy, but you need to use `{http://www.w3.org/XML/1998/namespace}lang` (with either `get()` or `attrib`), – mzjn Jan 07 '22 at 11:42
  • 3
    With the built-in ElementTree, spelling out the canonical name of the attribute is the best you can do, since attributes are implemented as dicts on elements instead of stand-alone attribute nodes, and XPath support is only rudimentary. With lxml, you can use a complete implementation of XPath, including namespace prefixes for attributes, i.e. this would work as expected: `tree.xpath('//@xml:lang', namespaces=ns)` and give `['en']`. – Tomalak Jan 07 '22 at 12:01

1 Answers1

0

Here a quick guide how to orient in a xml file using lxml.etree

In [2]: import lxml.etree as etree

In [3]: xml = """
   ...:     <entry xmlns="http://www.w3.org/2005/Atom" xmlns:demo="http://www.wh
   ...: atever.com">
   ...:       <id>1</id>
   ...:       <demo:demo_child>some namespace entry</demo:demo_child>
   ...:       <title>Example XML</title>
   ...:       <published>2021-12-15T00:00:00Z</published>
   ...:       <updated>2022-01-06T12:44:47Z</updated>
   ...:       <content type="application/xml">
   ...:         <articleDoc xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema
   ...: -instance" schemaVersion="1.8" xml:lang="en">
   ...:           <articleDocHead>
   ...:             <itemInfo/>
   ...:           </articleDocHead>
   ...:         </articleDoc>
   ...:       </content>
   ...:     </entry>"""

In [4]: tree = etree.fromstring(xml)

In [5]: tree
Out[5]: <Element {http://www.w3.org/2005/Atom}entry at 0x7d010c153800>

In [6]: list(tree.iterchildren())  # get children of cuurent element
Out[6]: 
[<Element {http://www.w3.org/2005/Atom}id at 0x7d010c1b06c0>,
 <Element {http://www.whatever.com}demo_child at 0x7d010c9c54c0>,
 <Element {http://www.w3.org/2005/Atom}title at 0x7d010c9c5180>,
 <Element {http://www.w3.org/2005/Atom}published at 0x7d01233d6cc0>,
 <Element {http://www.w3.org/2005/Atom}updated at 0x7d010c0d4580>,
 <Element {http://www.w3.org/2005/Atom}content at 0x7d010c0d46c0>]

In [7]: print([el.tag for el in tree.iterchildren()])    # get children of cuurent element (human readable)
['{http://www.w3.org/2005/Atom}id', '{http://www.whatever.com}demo_child', '{http://www.w3.org/2005/Atom}title', '{http://www.w3.org/2005/Atom}published', '{http://www.w3.org/2005/Atom}updated', '{http://www.w3.org/2005/Atom}content']

In [8]: print(tree[0] == next(tree.iterchildren()))  # you can also access by #tree[index]
True

In [9]: tree.find('id')  # FAILS: did not consider default namespace

In [10]: tree.find('{http://www.w3.org/2005/Atom}id')  # now it works
Out[10]: <Element {http://www.w3.org/2005/Atom}id at 0x7d010c1b06c0>

In [11]: tree.find('{http://www.w3.org/2005/Atom}demo_child')  # FAILS: element with non-default namespace

In [12]: tree.find('{http://www.whatever.com}demo_child')  # take proper namespace
Out[12]: <Element {http://www.whatever.com}demo_child at 0x7d010c9c54c0>

In [13]: tree.find(f'{{{tree.nsmap["demo"]}}}demo_child')  # do not spell out full namespace
Out[13]: <Element {http://www.whatever.com}demo_child at 0x7d010c9c54c0>

In [14]: tree.find('{http://www.w3.org/2005/Atom}content').find('articleDoc')  # follow path of elements
Out[14]: <Element articleDoc at 0x7d010c13d9c0>

In [15]: tree.xpath('//tmp_ns:id', namespaces={'tmp_ns': tree.nsmap[None]})  # use xpath, handling default namespace is tedious here
Out[15]: [<Element {http://www.w3.org/2005/Atom}id at 0x7d010c1b06c0>]

In [16]: tree.xpath('//articleDoc')  # find elements not being a direct child
Out[16]: [<Element articleDoc at 0x7d010c13d9c0>]

In [17]: tree.xpath('//@type')  # search for attribute
Out[17]: ['application/xml']

In [18]: tree.xpath('//@xml:lang')  # search for other attribute
Out[18]: ['en']
Markus Dutschke
  • 9,341
  • 4
  • 63
  • 58