7

I am trying to open an xml file, and get values from certain tags. I have done this a lot but this particular xml is giving me some issues. Here is a section of the xml file:

<?xml version='1.0' encoding='UTF-8'?>
<package xmlns="http://apple.com/itunes/importer" version="film4.7">
  <provider>filmgroup</provider>
  <language>en-GB</language>
  <actor name="John Smith" display="Doe John"</actor>
</package>

And here is a sample of my python code:

metadata = '/Users/mylaptop/Desktop/Python/metadata.xml'
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
open(metadata)
tree = etree.parse(metadata, parser)
root = tree.getroot()
for element in root.iter(tag='provider'):
    providerValue = tree.find('//provider')
    providerValue = providerValue.text
    print providerValue
tree.write('/Users/mylaptop/Desktop/Python/metadataDone.xml', pretty_print = True, xml_declaration = True, encoding = 'UTF-8')

When I run this it can't find the provider tag or its value. If I remove xmlns="http://apple.com/itunes/importer" then all work as expected. My question is how can I remove this namespace, as i'm not at all interested in this, so I can get the tag values I need using lxml?

pnuts
  • 58,317
  • 11
  • 87
  • 139
speedyrazor
  • 3,127
  • 7
  • 33
  • 51

2 Answers2

11

The provider tag is in the http://apple.com/itunes/importer namespace, so you either need to use the fully qualified name

{http://apple.com/itunes/importer}provider

or use one of the lxml methods that has the namespaces parameter, such as root.xpath. Then you can specify it with a namespace prefix (e.g. ns:provider):

from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(metadata, parser)
root = tree.getroot()
namespaces = {'ns':'http://apple.com/itunes/importer'}
items = iter(root.xpath('//ns:provider/text()|//ns:actor/@name',
                       namespaces=namespaces))
for provider, actor in zip(*[items]*2):
    print(provider, actor)

yields

('filmgroup', 'John Smith')

Note that the XPath used above assumes that <provider> and <actor> elements always appear in alternation. If that is not true, then there are of course ways to handle it, but the code becomes a bit more verbose:

for package in root.xpath('//ns:package', namespaces=namespaces):
    for provider in package.xpath('ns:provider', namespaces=namespaces):
        providerValue = provider.text
        print providerValue
    for actor in package.xpath('ns:actor', namespaces=namespaces):
        print actor.attrib['name']
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • ubuntu, how would I find an attribute of a tag, I have ammended my original example, so i'm looking for the value of actor name= – speedyrazor Aug 05 '13 at 21:51
  • If you have the `element`, you can access the attribute value with `element.attrib['name']`. However, if you are scraping `provider` and `actor` elements from an XML file, you could set up a single XPath to do both at once using the `|` (or) syntax. I've edited the post to show what I mean. – unutbu Aug 05 '13 at 22:06
  • And one last question, what if I have multiple 'title' tags in my xml, how can I provide an absolute xpath to the exact title I need please? – speedyrazor Aug 06 '13 at 05:13
  • Start by going through this [XPath tutorial](http://www.w3schools.com/xpath/xpath_syntax.asp). It will show you all the basic ways to specify XPaths. – unutbu Aug 06 '13 at 13:49
  • How would I go about removing the namespace altogether, so there was no xmlns=.... ?? – speedyrazor Aug 09 '13 at 21:05
  • That is a radical change since removing namespaces can change the meaning of the XML. I would try hard to work with the namespaces, as shown above. However -- having stated this warning -- there is [a way to remove namespaces](http://stackoverflow.com/q/13591707/190597). – unutbu Aug 09 '13 at 21:17
  • Your original solution works a treat, but for another application I wish to totally remove all namespaces and prefix, so my root package tag would be all on its own and be , with no xmlns. – speedyrazor Aug 09 '13 at 21:42
2

My suggestion is to not ignore the namespace but, instead, to take it into account. I wrote some related functions (copied with slight modification) for my work on the django-quickbooks library. With these functions, you should be able to do this:

providers = getels(root, 'provider', ns='http://apple.com/itunes/importer')

Here are those functions:

def get_tag_with_ns(tag_name, ns):
    return '{%s}%s' % (ns, tag_name)

def getel(elt, tag_name, ns=None):
    """ Gets the first tag that matches the specified tag_name taking into
    account the QB namespace.

    :param ns: The namespace to use if not using the default one for
    django-quickbooks.
    :type  ns: string
    """

    res = elt.find(get_tag_with_ns(tag_name, ns=ns))
    if res is None:
        raise TagNotFound('Could not find tag by name "%s"' % tag_name)
    return res

def getels(elt, *path, **kwargs):
    """ Gets the first set of elements found at the specified path.

    Example:
        >>> xml = (
        "<root>" +
            "<item>" +
                "<id>1</id>" +
            "</item>" +
            "<item>" +
                "<id>2</id>"* +
            "</item>" +
        "</root>")
        >>> el = etree.fromstring(xml)
        >>> getels(el, 'root', 'item', ns='correct/namespace')
        [<Element item>, <Element item>]
    """

    ns = kwargs['ns']

    i=-1
    for i in range(len(path)-1):
        elt = getel(elt, path[i], ns=ns)
    tag_name = path[i+1]
    return elt.findall(get_tag_with_ns(tag_name, ns=ns))
Josh
  • 12,896
  • 4
  • 48
  • 49