LXML Xpath does not seem to return full path

Question

OK I'll be the first to admit its is, just not the path I want and I don't know how to get it.

I'm using Python 3.3 in Eclipse with Pydev plugin in both Windows 7 at work and ubuntu 13.04 at home. I'm new to python and have limited programming experience.

I'm trying to write a script to take in an XML Lloyds market insurance message, find all the tags and dump them in a .csv where we can easily update them and then reimport them to create an updated xml.

I have managed to do all of that except when I get all the tags it only gives the tag name and not the tags above it.

<TechAccount Sender="broker" Receiver="insurer">
<UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
<BrokerReference>HOY123/456</BrokerReference>
<ServiceProviderReference>2012080921401A1</ServiceProviderReference>
<CreationDate>2012-08-10</CreationDate>
<AccountTransactionType>premium</AccountTransactionType>
<GroupReference>2012080921401A1</GroupReference>
<ItemsInGroupTotal>
<Count>1</Count>
</ItemsInGroupTotal>
<ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
<ServiceProviderGroupItemsTotal>
<Count>13</Count>
</ServiceProviderGroupItemsTotal>

That is a fragment of the XML. What I want is to find all the tags and their path. For example for I want to show it as ItemsInGroupTotal/Count but can only get it as Count.

Here is my code:

xml = etree.parse(fullpath)
print( xml.xpath('.//*'))
all_xpath = xml.xpath('.//*')
every_tag = []
for i in all_xpath:
    single_tag = '%s,%s' % (i.tag, i.text)
    every_tag.append(single_tag)
print(every_tag)

This gives:

'{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupReference,8-2012-08-10', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}ServiceProviderGroupItemsTotal,\n', '{http://www.ACORD.org/standards/Jv-Ins-Reinsurance/1}Count,13',

As you can see Count is shown as {namespace}Count, 13 and not {namespace}ItemsInGroupTotal/Count, 13

Can anyone point me towards what I need?

Thanks (hope my first post is OK)

Adam

EDIT:

This is my code now: with open(fullpath, 'rb') as xmlFilepath: xmlfile = xmlFilepath.read()

fulltext = '%s' % xmlfile
text = fulltext[2:]
print(text)


xml = etree.fromstring(fulltext)
tree = etree.ElementTree(xml)

every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
print(every_tag)

But this returns an error: ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

I remove the first two chars as thy are b' and it complained it didn't start with a tag

Update:

I have been playing around with this and if I remove the xis: xxx tags and the namespace stuff at the top it works as expected. I need to keep the xis tags and be able to identify them as xis tags so can't just delete them.

Any help on how I can achieve this?

alecxe · Answer 1 · 2013-07-10T07:21:52.483

ElementTree objects have a method getpath(element), which returns a structural, absolute XPath expression to find that element

Calling getpath on each element in a iter() loop should work for you:

from pprint import pprint
from lxml import etree


text = """
<TechAccount Sender="broker" Receiver="insurer">
    <UUId>2EF40080-F618-4FF7-833C-A34EA6A57B73</UUId>
    <BrokerReference>HOY123/456</BrokerReference>
    <ServiceProviderReference>2012080921401A1</ServiceProviderReference>
    <CreationDate>2012-08-10</CreationDate>
    <AccountTransactionType>premium</AccountTransactionType>
    <GroupReference>2012080921401A1</GroupReference>
    <ItemsInGroupTotal>
        <Count>1</Count>
    </ItemsInGroupTotal>
    <ServiceProviderGroupReference>8-2012-08-10</ServiceProviderGroupReference>
    <ServiceProviderGroupItemsTotal>
        <Count>13</Count>
    </ServiceProviderGroupItemsTotal>
</TechAccount>
"""

xml = etree.fromstring(text)
tree = etree.ElementTree(xml)

every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)

prints:

['/TechAccount, \n',
 '/TechAccount/UUId, 2EF40080-F618-4FF7-833C-A34EA6A57B73',
 '/TechAccount/BrokerReference, HOY123/456',
 '/TechAccount/ServiceProviderReference, 2012080921401A1',
 '/TechAccount/CreationDate, 2012-08-10',
 '/TechAccount/AccountTransactionType, premium',
 '/TechAccount/GroupReference, 2012080921401A1',
 '/TechAccount/ItemsInGroupTotal, \n',
 '/TechAccount/ItemsInGroupTotal/Count, 1',
 '/TechAccount/ServiceProviderGroupReference, 8-2012-08-10',
 '/TechAccount/ServiceProviderGroupItemsTotal, \n',
 '/TechAccount/ServiceProviderGroupItemsTotal/Count, 13']

UPD: If your xml data is in the file test.xml, the code would look like:

from pprint import pprint
from lxml import etree

xml = etree.parse('test.xml').getroot()
tree = etree.ElementTree(xml)

every_tag = ['%s, %s' % (tree.getpath(e), e.text) for e in xml.iter()]
pprint(every_tag)

Hope that helps.

many thanks for this but I'm having trouble getting it to work for me. I read the XML from a file rather than directly putting it in the text and my attempts at converting it to a string seem to fail. Any tips on achieving this? — user2565150, Jul 10 '13 at 07:07
Sure, replace `etree.fromstring(text)` with `etree.parse(file_name)`. — alecxe, Jul 10 '13 at 07:08
Sorry should have said I tried that and got: TypeError: Argument 'element' has incorrect type (expected lxml.etree._Element, got lxml.etree._ElementTree) — user2565150, Jul 10 '13 at 07:13
Thanks for the quick responses, your updated code works for me but outputs:'/*, \n', '/*/*, \n', '/*/*/*[1], 2EF40080-F618-4FF7-833C-A34EA6A57B73', '/*/*/*[2], HOY123/456', '/*/*/*[3], etc. Does this point to the xml not being formatted in a way to get the paths how I want? Is there a way to post the whole xml?Its 177 lines long, shall I just paste the whole thing in the question? — user2565150, Jul 10 '13 at 07:29
Yeah, `getpath` won't work with your complicated xml with namespaces. — alecxe, Jul 10 '13 at 18:27
(aside: pastebin.com is full of animated ads; out of kindness to folks not using adblockers, please consider gist.github.com, sprunge.us, ix.io, refheap.com, or otherwise something else without them). — Charles Duffy, Jul 30 '13 at 15:47

Brecht Machiels · Answer 2 · 2014-02-05T09:31:28.527

getpath() does indeed return an xpath that's not suited for human consumption. From this xpath, you can build up a more useful one though. Such as with this quick-and-dirty approach:

def human_xpath(element):
    full_xpath = element.getroottree().getpath(element)
    xpath = ''
    human_xpath = ''
    for i, node in enumerate(full_xpath.split('/')[1:]):
        xpath += '/' + node
        element = element.xpath(xpath)[0]
        namespace, tag = element.tag[1:].split('}', 1)
        if element.getparent() is not None:
            nsmap = {'ns': namespace}
            same_name = element.getparent().xpath('./ns:' + tag,
                                                  namespaces=nsmap)
            if len(same_name) > 1:
                tag += '[{}]'.format(same_name.index(element) + 1)
        human_xpath += '/' + tag
    return human_xpath

It is interesting. I am finding a similar issues. I opened a question, How can I browse & list XPATH of a XML Message? https://stackoverflow.com/questions/29173143/how-can-i-browse-list-xpath-of-a-xml-message How do I plug_in the human_xpath python function into my posted code ? Thanks for your guidance. — user2647763 - RIMD, Jul 04 '20 at 16:16

LXML Xpath does not seem to return full path

2 Answers2