finding elements by attribute with lxml

Question

I need to parse a xml file to extract some data. I only need some elements with certain attributes, here's an example of document:

<root>
    <articles>
        <article type="news">
             <content>some text</content>
        </article>
        <article type="info">
             <content>some text</content>
        </article>
        <article type="news">
             <content>some text</content>
        </article>
    </articles>
</root>

Here I would like to get only the article with the type "news". What's the most efficient and elegant way to do it with lxml?

I tried with the find method but it's not very nice:

from lxml import etree
f = etree.parse("myfile")
root = f.getroot()
articles = root.getchildren()[0]
article_list = articles.findall('article')
for article in article_list:
    if "type" in article.keys():
        if article.attrib['type'] == 'news':
            content = article.find('content')
            content = content.text

score 95 · Accepted Answer · answered Feb 23 '11 at 15:36

You can use xpath, e.g. root.xpath("//article[@type='news']")

This xpath expression will return a list of all <article/> elements with "type" attributes with value "news". You can then iterate over it to do what you want, or pass it wherever.

To get just the text content, you can extend the xpath like so:

root = etree.fromstring("""
<root>
    <articles>
        <article type="news">
             <content>some text</content>
        </article>
        <article type="info">
             <content>some text</content>
        </article>
        <article type="news">
             <content>some text</content>
        </article>
    </articles>
</root>
""")

print root.xpath("//article[@type='news']/content/text()")

and this will output ['some text', 'some text']. Or if you just wanted the content elements, it would be "//article[@type='news']/content" -- and so on.

score 18 · Answer 2 · edited Sep 02 '16 at 17:12

18

Just for reference, you can achieve the same result with findall:

root = etree.fromstring("""
<root>
    <articles>
        <article type="news">
             <content>some text</content>
        </article>
        <article type="info">
             <content>some text</content>
        </article>
        <article type="news">
             <content>some text</content>
        </article>
    </articles>
</root>
""")

articles = root.find("articles")
article_list = articles.findall("article[@type='news']/content")
for a in article_list:
    print a.text

edited Sep 02 '16 at 17:12

Matthias Wiehl

1,799
16
22

answered Feb 02 '15 at 10:09

Kjir

4,437
4
29
34

how would it work if an attribute has a namespace. For instance, in the above example attribute `type` is something like `imx:type`? Where `imx = 'https://some.namespace.imx'` – Alex Raj Kaliamoorthy May 20 '19 at 21:34
@AlexRajKaliamoorthy in that case you can provide `findall` with a `namespaces` argument containing a dictionary of prefix/namespace mappings, e.g. `articles.findall("article[@type='news']/content", namespaces=root.nsmap)` or you could construct it manually, like `namespaces={"imx": "https://some.namespace.imx"}` – L0tad Feb 14 '23 at 17:49

finding elements by attribute with lxml

2 Answers2

Linked