How to remove whitespace in BeautifulSoup

Question

I have a bunch of HTML I'm parsing with BeautifulSoup and it's been going pretty well except for one minor snag. I want to save the output into a single-lined string, with the following as my current output:

    <li><span class="plaincharacterwrap break">
                    Zazzafooky but one two three!
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky2
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky3
                </span></li>

Ideally I'd like

<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li>

There's a lot of redundant whitespace that I'd like to get rid of but it's not necessarily removable using strip(), nor can I blatantly remove all the spaces because I need to retain the text. How can I do it? It seems like a common enough problem that regex would be overkill, but is that the only way?

I don't have any <pre> tags so I can be a little more forceful there.

Thanks once again!

You can do what browsers do: Collapse all adjacent whitespace (in text) into single spaces. — , Nov 24 '10 at 19:38

Andrew Clark · Accepted Answer · 2012-11-05T00:34:34.673

Here is how you can do it without regular expressions:

>>> html = """    <li><span class="plaincharacterwrap break">
...                     Zazzafooky but one two three!
...                 </span></li>
... <li><span class="plaincharacterwrap break">
...                     Zazzafooky2
...                 </span></li>
... <li><span class="plaincharacterwrap break">
...                     Zazzafooky3
...                 </span></li>
... """
>>> html = "".join(line.strip() for line in html.split("\n"))
>>> html
'<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li><li><span class="plaincharacterwrap break">Zazzafooky3</span></li>'

score 19 · Answer 2 · edited Sep 15 '22 at 01:54

19

Old question, I know, but beautifulsoup4 has this helper called stripped_strings.

Try this:

description_el = about.find('p', { "class": "description" })
descriptions = list(description_el.stripped_strings)
description = "\n\n".join(descriptions) if descriptions else ""

edited Sep 15 '22 at 01:54

Matthew Walker

2,527
3
24
30

answered Sep 15 '13 at 13:24

twig

4,034
5
37
47

score 2 · Answer 3 · answered Nov 24 '10 at 19:42

re.sub(r'[\ \n]{2,}', '', yourstring)

Regex [\ \n]{2} matches newlines and spaces (has to be escaped) when there's more than two or more of them. The more thorough implementation is this:

re.sub('\ {2,}', '', yourstring)
re.sub('\n*', '', yourstring)

I would think the first would only replace multiple newlines, but it seems (at least for me) to work just fine.

score 1 · Answer 4 · answered Aug 25 '20 at 18:00

In case you came here after getting troubled by BeautifulSoup prettify(). I think this solution won't add extra spaces.

from lxml import html, etree

doc = html.fromstring(open('inputfile.html').read())
out = open('out.html', 'wb')
out.write(etree.tostring(doc))

Please see this Ian Bicking's answer on stackoverflow

Parsing via xml.etree is simple...

from xml.etree import ElementTree as ET
tree = ET.parse('out.html')
title = tree.find(".//title").text
print(title)

How to remove whitespace in BeautifulSoup

4 Answers4

Linked

Related