Parsing the html of the child element [BeautifulSoup]

Question

I have only two weeks learning python.

I'm scraping an XML file and one of the elements of the loop [item->description], have HTML inside, how could I get the text inside p?

url="https://www.milenio.com/rss"
source=requests.get(url)
soup=BeautifulSoup(source.content, features="xml")

items=soup.findAll('item')

for item in items:
  html_text=item.description
  # This returns HTML code: <p>Paragraph 1</p> <p>Paragraph 2</p>

This next line could work, BUT I got some internal, external links and images, which isn't required.

desc=item.description.get_text()

So, if I make a loop o trying to get all the p, it doesn't work.

for p in html_text.find_all('p'):
  print(p)

AttributeError: 'NoneType' object has no attribute 'find_all'

Thank you so much!

Use this SO link: https://stackoverflow.com/questions/2032172/how-can-i-grab-cdata-out-of-beautifulsoup — jose_bacoy, Mar 10 '20 at 13:49

JHeth · Answer 1 · 2020-03-10T21:00:37.953

The issue is how bs4 processes CData (it's pretty well documented but not very solved).

You'll need to import CData from bs4 which will help extract the CData as a string and use the html.parser library, from there create a new bs4 object with that string to give it a findAll attribute and iterate over it's contents.

from bs4 import BeautifulSoup, CData
import requests

url="https://www.milenio.com/rss"
source=requests.get(url)
soup = BeautifulSoup(source.content, 'html.parser')

items=soup.findAll('item')

for item in items:
  html_text = item.description
  findCdata = html_text.find(text=lambda tag: isinstance(tag, CData))
  newSoup = BeautifulSoup(findCdata, 'html.parser')
  paragraphs = newSoup.findAll('p')
  for p in paragraphs:
    print(p.get_text())

Edit: OP needed to extract link text and found that to only be possible inside the item loop using link = item.link.nextSibling because the link content was jumping outside of its tag like so </link>http://www.... In XML tree view this particular XML doc showed a drop down for the link element which is likely the cause.

To get content from other tags inside the document that don't show a dropdown in XML tree view and don't have nested CData convert the tag to lowercase and return the text as usual:

item.pubdate.get_text() # Gets contents the tag <pubDate>
item.author.get_text() # Gets contents of the tag <author>

Thank you very much for this, it works very well but there is a problem now. With "html.parser" the link text gets outside the tag... https://www.milenio.com/estados/coahuila-alcaldes-marchan-falta-compra-carbon-cfe || I'm trying with item.text but doesn't work. Is there a way to get that link? Thank you so MUCH! — Alex Güemez, Mar 10 '20 at 19:32
This code worked, to get the link after the tag " `link=item.link.nextSibling` — Alex Güemez, Mar 10 '20 at 20:24
Nice, it's strange that link is the only instance where this happens in your particular case. For instance, the `` tag can be grabbed with `item.pubdate.get_text()` and it stays inside its tags. It's probably related to the fact that your link elements get a dropdown in XML tree view. I'll edit the answer to include more information on this for future purposes. — JHeth, Mar 10 '20 at 20:54

score 0 · Answer 2 · answered Mar 10 '20 at 09:47

0

this should look like this:

for item in items:
    html_text=item.description #??

    #!! dont use html_text.find_all !!
    for p in item.find_all('p'):
        print(p)

answered Mar 10 '20 at 09:47

Wonka

1,548
1
13
20

Parsing the html of the child element [BeautifulSoup]

2 Answers2