0

I have only two weeks learning python.

I'm scraping an XML file and one of the elements of the loop [item->description], have HTML inside, how could I get the text inside p?

url="https://www.milenio.com/rss"
source=requests.get(url)
soup=BeautifulSoup(source.content, features="xml")

items=soup.findAll('item')

for item in items:
  html_text=item.description
  # This returns HTML code: <p>Paragraph 1</p> <p>Paragraph 2</p>

This next line could work, BUT I got some internal, external links and images, which isn't required.

desc=item.description.get_text()

So, if I make a loop o trying to get all the p, it doesn't work.

for p in html_text.find_all('p'):
  print(p)

AttributeError: 'NoneType' object has no attribute 'find_all'

Thank you so much!

Alex Güemez
  • 27
  • 1
  • 5

2 Answers2

1

The issue is how bs4 processes CData (it's pretty well documented but not very solved).

You'll need to import CData from bs4 which will help extract the CData as a string and use the html.parser library, from there create a new bs4 object with that string to give it a findAll attribute and iterate over it's contents.

from bs4 import BeautifulSoup, CData
import requests

url="https://www.milenio.com/rss"
source=requests.get(url)
soup = BeautifulSoup(source.content, 'html.parser')

items=soup.findAll('item')

for item in items:
  html_text = item.description
  findCdata = html_text.find(text=lambda tag: isinstance(tag, CData))
  newSoup = BeautifulSoup(findCdata, 'html.parser')
  paragraphs = newSoup.findAll('p')
  for p in paragraphs:
    print(p.get_text())

Edit: OP needed to extract link text and found that to only be possible inside the item loop using link = item.link.nextSibling because the link content was jumping outside of its tag like so </link>http://www.... In XML tree view this particular XML doc showed a drop down for the link element which is likely the cause.

To get content from other tags inside the document that don't show a dropdown in XML tree view and don't have nested CData convert the tag to lowercase and return the text as usual:

item.pubdate.get_text() # Gets contents the tag <pubDate>
item.author.get_text() # Gets contents of the tag <author>
JHeth
  • 7,067
  • 2
  • 23
  • 34
  • Thank you very much for this, it works very well but there is a problem now. With "html.parser" the link text gets outside the tag... https://www.milenio.com/estados/coahuila-alcaldes-marchan-falta-compra-carbon-cfe || I'm trying with item.text but doesn't work. Is there a way to get that link? Thank you so MUCH! – Alex Güemez Mar 10 '20 at 19:32
  • This code worked, to get the link after the tag " `link=item.link.nextSibling` – Alex Güemez Mar 10 '20 at 20:24
  • Nice, it's strange that link is the only instance where this happens in your particular case. For instance, the `` tag can be grabbed with `item.pubdate.get_text()` and it stays inside its tags. It's probably related to the fact that your link elements get a dropdown in XML tree view. I'll edit the answer to include more information on this for future purposes. – JHeth Mar 10 '20 at 20:54
0

this should look like this:

for item in items:
    html_text=item.description #??

    #!! dont use html_text.find_all !!
    for p in item.find_all('p'):
        print(p)
Wonka
  • 1,548
  • 1
  • 13
  • 20