3

I'm using a European Space Agency API to query (result can be viewed here) for satellite image metadata to parse into python objects.

Using the requests library I can successfully get the result in XML format and then read the content with lxml. I am able to find the elements and explore the tree as expected:

# loading the response into an ElementTree
tree = etree.fromstring(response.content)
root = tree.getroot()
ns = root.nsmap

# get the first entry element and its summary
e = root.find('entry',ns)
summary = e.find('summary',ns).text

print summary

>> 'Date: 2018-11-28T09:10:56.879Z, Instrument: OLCI, Mode: , Satellite: Sentinel-3, Size: 713.99 MB'

The entry element has several date descendants with different values of the attriubute name:

for d in e.findall('date',ns):
    print d.tag, d.attrib

>> {http://www.w3.org/2005/Atom}date {'name': 'creationdate'} {http://www.w3.org/2005/Atom}date {'name': 'beginposition'} {http://www.w3.org/2005/Atom}date {'name': 'endposition'} {http://www.w3.org/2005/Atom}date {'name': 'ingestiondate'}

I want to grab the beginposition date element using XPath syntax [@attrib='value'] but it just returns None. Even just searching for a date element with the name attribute ([@attrib]) returns None:

dt_begin = e.find('date[@name="beginposition"]',ns) # dt_begin is None
dt_begin = e.find('date[@name]',ns)                 # dt_begin is None

The entry element includes other children that exhibit the same behaviour e.g. multiple str elements also with differing name attributes.

Has anyone encountered anything similar or is there something I'm missing? I'm using Python 2.7.14 with lxml 4.2.4

Ali
  • 150
  • 1
  • 6
  • When trying to access the resource linked in the question, I am prompted for username and password. Please provide a [mcve]. – mzjn Nov 28 '18 at 16:34
  • @mzjn I've added a link to the result on pastebin – Ali Nov 28 '18 at 16:43

1 Answers1

2

It looks like an explicit prefix is needed when a predicate ([@name="beginposition"]) is used. Here is a test program:

from lxml import etree

print etree.LXML_VERSION

tree = etree.parse("data.xml")  

ns1 = tree.getroot().nsmap
print ns1
print tree.find('entry', ns1)
print tree.find('entry/date', ns1)
print tree.find('entry/date[@name="beginposition"]', ns1)

ns2 = {"atom": 'http://www.w3.org/2005/Atom'}
print tree.find('atom:entry', ns2)
print tree.find('atom:entry/atom:date', ns2)
print tree.find('atom:entry/atom:date[@name="beginposition"]', ns2)

Output:

(4, 2, 5, 0)
{None: 'http://www.w3.org/2005/Atom', 'opensearch': 'http://a9.com/-/spec/opensearch/1.1/'}
<Element {http://www.w3.org/2005/Atom}entry at 0x7f8987750b90>
<Element {http://www.w3.org/2005/Atom}date at 0x7f89877503f8>
None
<Element {http://www.w3.org/2005/Atom}entry at 0x7f8987750098>
<Element {http://www.w3.org/2005/Atom}date at 0x7f898774a950>
<Element {http://www.w3.org/2005/Atom}date at 0x7f898774a7a0>
mzjn
  • 48,958
  • 13
  • 128
  • 248
  • Thanks for this concise example. I see now that using the predicates requires the explicit use of namespaces including the default namespace. On the other hand if a predicate is not used in the find/findall method the default namespace does not have to be explicitly stated. – Ali Nov 29 '18 at 10:51
  • I am still a bit confused, to be honest... I don't really understand why it works like this. When using `xpath()` instead of `find()` or `findall()`, a prefix is always required (with or without a predicate). – mzjn Nov 29 '18 at 11:42
  • Yeah, I do find it strange. I've not been able to find any documentation that explains clearly why there are these different behaviours. – Ali Nov 29 '18 at 12:03