0

I am using urllib and beautifulsoup to parse xml file in django. I can't parse the content of description tag with CDATA.

my xml tag.

<item>
         <title>EU Confronting US Over Surveillance</title>
    <description><![CDATA[Voice of America is an international news and broadcast organization serving Central and Eastern Europe, the Caucasus, Central Asia, Russia, the Middle East and Balkan countries]]></description>
<guid>http://www.voanews.com/content/eu-confronting-us-over-surveillance/1778928.html</guid>
</item>

This description tag is inside the item tag views.py

for i in soup.findAll('item'):
 print i.description.string

If CDATA is not there means I can parse the contents inside descirption tag. I don't know how to parse this content. Please help me out Also how to get the image inside the tag..

<description>&lt;img src='http://static.ibnlive.in.com/ibnlive/pix/sitepix/10_2013/tony-abbott-visits-afghanistan-says-australias-war-is-over_291013013344_338x225.jpg' width='90' height='62'&gt;&lt;p&gt;"Australia's longest war" is ending and its defence forces mission in Afghanistan will be complete by 2013 end, Prime Minister Tony Abbott announced in a statement on Tuesday.&lt;/p&gt;</description>
Madanika
  • 119
  • 1
  • 8
  • Possible duplicate of [Can CDATA sections be preserved by BeautifulSoup?](http://stackoverflow.com/questions/16426507/can-cdata-sections-be-preserved-by-beautifulsoup) – user985366 Nov 22 '16 at 10:17

1 Answers1

0

CData can be accessed like this:

>>> import BeautifulSoup
>>> txt = '''<description><![CDATA[Voice of America is an international news and broadcast organization serving Central and Eastern Europe, the Caucasus, Central Asia, Russia, the Middle East and Balkan countries]]></description>'''
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for cd in soup.findAll(text=True):
...   if isinstance(cd, BeautifulSoup.CData):
...     print 'CData value: %r' % cd
...
CData value: u'Voice of America is an international news and broadcast organi
zation serving Central and Eastern Europe, the Caucasus, Central Asia, Russia, t
he Middle East and Balkan countries'
>>>

An edit based on your comment that should help.

from bs4 import BeautifulSoup, CData
import urllib

source_txt = urllib.urlopen("http://voanews.com/api/epiqq")
soup = BeautifulSoup.BeautifulSoup(source_txt.read())
for cd in soup.findAll(text=True):
    if isinstance(cd, CData):
        print 'CData value: %r' % cd        

Things to note:

  • The import statement. I'm importing the entire BeautifulSoup package
  • The urlopen parameter. It needs the http
Andy
  • 49,085
  • 60
  • 166
  • 233
  • Am parsing news feed. example I took this url "http://www.voanews.com/api/epiqq". In ur code in 3rd line, it shows the following error AttributeError: type object 'BeautifulSoup' has no attribute 'read'. My pared the whole url as sorce_txt=urllib .urlopen("http://www.voanews.c/api/epiqq") b=BeautifulSoup(sorce_txt.read())..... not a particular xml tag. I want to parse the whole url i.e. each and every description tag in that... Give me the suggestion to how to do it Andy – Madanika Oct 29 '13 at 05:07
  • yes your code is correct Andy... I need how to do it for the whole url. – Madanika Oct 29 '13 at 05:18
  • I am trying and tell you if I get any error... please guide me at that time Andy – Madanika Oct 29 '13 at 05:24
  • >>> import urllib >>> source_txt=urllib.urlopen("http://www.voanews.com/api/epiqq") >>> from bs4 import BeautifulSoup >>> b=BeautifulSoup(source_txt.read()) >>> for i in b.findAll('item'): ... for j in i.findAll('description'): ... for k in j.findAll(text=True): ... print i I didn't get anything.. nothing gets printed.. please help me – Madanika Oct 29 '13 at 05:48
  • yes Andy you are correct... if it is a single tag means we can define it inside string and parse like you already said... Am parsing an url. The xml file of this url contains more tags like this so I can't give it as a string. In that case how can I parse? Will check it and tell u – Madanika Oct 29 '13 at 14:04
  • You will need to loop over the `soup` value for each `` – Andy Oct 29 '13 at 14:09
  • >>> soup=BeautifulSoup.BeautifulSoup(source_txt.read()) Traceback (most recent call last): File "", line 1, in AttributeError: type object 'BeautifulSoup' has no attribute 'BeautifulSoup' I get this error. I have given like this... from bs4 import BeautifulSoup – Madanika Oct 29 '13 at 14:11
  • from bs4 import BeautifulSoup – Madanika Oct 29 '13 at 14:17