Unreachable xml feed entries

Question

I'm working on a python application supposed to make a request on a phonebook search api and format the received data. The entries are sent back as an xml feed looking like the exemple at the bottom.

I'm using feedparser to split the information.

What I'm struggling with, is the extraction of the e-mail field. This information is contained under the tag <tel:extra type="email">

I could only make it work to get the value of "type" for the last extra entry.

The one before and the content between the tags are unreachable.

Does anyone have some experience with this kind of feeds? Thank you for helping me.

API information

Python code:

import feedparser
  data = feedparser.parse(xml)
  entry = data.entries[0]
  print(entry.tel_extra)

XML example:

<?xml version="1.0" encoding="utf-8" ?>
    <feed xml:lang="de" xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:tel="http://tel.search.ch/api/spec/result/1.0/">
      <id>https://tel.search.ch/api/04b361c38a40dc3aab2355d79f221f86/5acc2bdfc4554dfd5a4bb10424cd597e</id>
      <title type="text">tel.search.ch API Search Results</title>
      <generator version="1.0" uri="https://tel.search.ch">tel.search.ch</generator>
      <updated>2018-02-12T03:00:00Z</updated>
      <link href="https://tel.search.ch/result.html?was=nestle&amp;wo=broc&amp;private=0" rel="alternate" type="text/html" />
      <link href="http://tel.search.ch/api/?was=nestle&amp;wo=broc&amp;private=0&amp;key=04b361c38a40dc3aab2355d79f221f86" type="application/atom+xml" rel="self" />
      <openSearch:totalResults>1</openSearch:totalResults>
      <openSearch:startIndex>1</openSearch:startIndex>
      <openSearch:itemsPerPage>20</openSearch:itemsPerPage>
      <openSearch:Query role="request" searchTerms="nestle broc" startPage="1" />
      <openSearch:Image height="1" width="1" type="image/gif">https://www.search.ch/audit/CP/tel/de/api</openSearch:Image>
      <entry>
        <id>urn:uuid:ca71838ddcbb6a92</id>
        <updated>2018-02-12T03:00:00Z</updated>
        <published>2018-02-12T03:00:00Z</published>
        <title type="text">Nestlé Suisse SA</title>
        <content type="text">Nestlé Suisse SA
        Fabrique de Broc
        rue Jules Bellet 7
        1636 Broc/FR
        026 921 51 51</content>
        <tel:nopromo>*</tel:nopromo>
        <author>
          <name>tel.search.ch</name>
        </author>
        <link href="https://tel.search.ch/broc/rue-jules-bellet-7/nestle-suisse-sa" title="Details" rel="alternate" type="text/html" />
        <link href="https://tel.search.ch/vcard/Nestle-Suisse-SA.vcf?key=ca71838ddcbb6a92" type="text/x-vcard" title="VCard Download" rel="alternate" />
        <link href="https://tel.search.ch/edit/?id=ca71838ddcbb6a92" rel="edit" type="text/html" />
        <tel:pos>1</tel:pos>
        <tel:id>ca71838ddcbb6a92</tel:id>
        <tel:type>Organisation</tel:type>
        <tel:name>Nestlé Suisse SA</tel:name>
        <tel:occupation>Fabrique de Broc</tel:occupation>
        <tel:street>rue Jules Bellet</tel:street>
        <tel:streetno>7</tel:streetno>
        <tel:zip>1636</tel:zip>
        <tel:city>Broc</tel:city>
        <tel:canton>FR</tel:canton>
        <tel:country>fr</tel:country>
        <tel:category>Schokolade</tel:category>
        <tel:phone>+41269215151</tel:phone>
        <tel:extra type="Fax Service technique">+41269215154</tel:extra>
        <tel:extra type="Fax">+41269215525</tel:extra>
        <tel:extra type="Besichtigung">+41269215960</tel:extra>
        <tel:extra type="email">maisoncailler@nestle.com</tel:extra>
        <tel:extra type="website">http://www.cailler.ch</tel:extra>
        <tel:copyright>Daten: Swisscom Directories AG</tel:copyright>
      </entry>
    </feed>

You want to read the documentation. https://pythonhosted.org/feedparser/namespace-handling.html — Tomalak, Feb 12 '18 at 12:13
This could be a bug in feedparser. It looks like the ``tel:extra`` entries are not getting parsed properly. The parsed feed contains the following entry for tel:extra => 'tel_extra': {'type': 'website'}, — PaW, Feb 12 '18 at 13:48

score 1 · Accepted Answer · answered Feb 12 '18 at 14:32

1

You may want to check out BeautifulSoup.

from bs4 import BeautifulSoup
soup = BeautifulSoup(xml, 'xml')

soup.find("tel:extra", attrs={"type":"email"}).text
Out[111]: 'maisoncailler@nestle.com'

answered Feb 12 '18 at 14:32

PaW

659
4
7

BeautifulSoup looks to be a good solution, but it seems to have problems with ' : ' character in the tag name. I found this related post which hasn't got an answer yet [link](https://stackoverflow.com/questions/26626908/how-to-find-an-xml-tag-with-special-character-in-python-beautifulsoup). – joel Feb 13 '18 at 09:59
My version of 'bs4' was not correctly installed. Remove and reinstall it, corrected the problem and made it work. – joel Feb 23 '18 at 07:43

Unreachable xml feed entries

1 Answers1