9

I am scraping a few links with BeautifulSoap however, it seems to completely ignore <br> tags.

Here is the relevant portion of source code of the URL I am scraping:

<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span id="something">&#xe800;</span></h1>

Here is my BeautifulSoap code (relevant part only) to get the text within h1 tags:

    soup = BeautifulSoup(page, 'html.parser')
    title_box = soup.find('h1', attrs={'class': 'para-title'})
    title = title_box.text.strip()
    print title

This gives the following output:

    A quick brown fox jumps overthe lazy dog

Whereas I am expecting:

    A quick brown fox jumps over the lazy dog

How can I replace the <br> with a space in my code?

mumer91
  • 113
  • 1
  • 9

3 Answers3

20

How about using the .get_text() with the separator parameter?

from bs4 import BeautifulSoup

page = '''<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span>some stuff here</span></h1>'''


soup = BeautifulSoup(page, 'html.parser')
title_box = soup.find('h1', attrs={'class': 'para-title'})
title = title_box.get_text(separator=" ").strip()
print (title)   

Output:

print (title)
A quick brown fox jumps over the lazy dog
 some stuff here
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • 1
    My apologies as the span tags do not contain any text (question edited) so this worked great for me. Thanks. – mumer91 Apr 09 '19 at 10:09
3

Using replace() on the html before parsing:

from bs4 import BeautifulSoup

html = '''<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span>some stuff here</span></h1>'''

html = html.replace("<br>", " ")
soup = BeautifulSoup(html, 'html.parser')
title_box = soup.find('h1', attrs={'class': 'para-title'})
title = title_box.get_text().strip()
print (title)

OUTPUT:

A quick brown fox jumps over the lazy dog
some stuff here

EDIT:

For the part OP mentioned in the comments below;

html = '''<div class="description">Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>'''

from bs4 import BeautifulSoup

html = html.replace("\n", ". ")
soup = BeautifulSoup(html, 'html.parser')
div_box = soup.find('div', attrs={'class': 'description'})
divText= div_box.get_text().strip()
print (divText)

OUTPUT:

Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four..
DirtyBit
  • 16,613
  • 4
  • 34
  • 55
  • In a different part of my code, I have line-breaks in text (no br, just line-breaks) that I am grabbing. How can I replace the line-break with a period and a space? – mumer91 Apr 09 '19 at 10:40
  • @mumer91 Could you post a sample of it, please? – DirtyBit Apr 09 '19 at 10:40
  • Here's HTML sample and my code: https://pastebin.com/Q8AnKvJy P.S. I can only post one question per 90 mins so using pastebin. ;) – mumer91 Apr 09 '19 at 10:45
  • Thanks but I get an error TypeError: 'NoneType' object is not callable. – mumer91 Apr 09 '19 at 11:13
  • @mumer91 did you copy-paste the code I posted? It is tested and works fine. On which line do you get the error? – DirtyBit Apr 09 '19 at 11:14
  • I probably messed up the question. Can u plz have a look here for more details? https://stackoverflow.com/questions/55592384/beautifulsoup-replace-line-breaks-with-period-and-space – mumer91 Apr 09 '19 at 12:01
  • @mumer91 posted a solution there as well, see if it helps? – DirtyBit Apr 09 '19 at 12:13
0

Use str.replace function :
print title.replace("<br>", " ")

Louis Saglio
  • 1,120
  • 10
  • 20
  • 1
    Using `replace("
    ", " ")` on `title` won't work. You will have to use it on the raw HTML, before passing it to BeautifulSoup.
    – Keyur Potdar Apr 09 '19 at 10:11