13

I have a bunch of HTML I'm parsing with BeautifulSoup and it's been going pretty well except for one minor snag. I want to save the output into a single-lined string, with the following as my current output:

    <li><span class="plaincharacterwrap break">
                    Zazzafooky but one two three!
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky2
                </span></li>
<li><span class="plaincharacterwrap break">
                    Zazzafooky3
                </span></li>

Ideally I'd like

<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li>

There's a lot of redundant whitespace that I'd like to get rid of but it's not necessarily removable using strip(), nor can I blatantly remove all the spaces because I need to retain the text. How can I do it? It seems like a common enough problem that regex would be overkill, but is that the only way?

I don't have any <pre> tags so I can be a little more forceful there.

Thanks once again!

Rio
  • 14,182
  • 21
  • 67
  • 107

4 Answers4

20

Here is how you can do it without regular expressions:

>>> html = """    <li><span class="plaincharacterwrap break">
...                     Zazzafooky but one two three!
...                 </span></li>
... <li><span class="plaincharacterwrap break">
...                     Zazzafooky2
...                 </span></li>
... <li><span class="plaincharacterwrap break">
...                     Zazzafooky3
...                 </span></li>
... """
>>> html = "".join(line.strip() for line in html.split("\n"))
>>> html
'<li><span class="plaincharacterwrap break">Zazzafooky but one two three!</span></li><li><span class="plaincharacterwrap break">Zazzafooky2</span></li><li><span class="plaincharacterwrap break">Zazzafooky3</span></li>'
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
19

Old question, I know, but beautifulsoup4 has this helper called stripped_strings.

Try this:

description_el = about.find('p', { "class": "description" })
descriptions = list(description_el.stripped_strings)
description = "\n\n".join(descriptions) if descriptions else ""
Matthew Walker
  • 2,527
  • 3
  • 24
  • 30
twig
  • 4,034
  • 5
  • 37
  • 47
2
re.sub(r'[\ \n]{2,}', '', yourstring)

Regex [\ \n]{2} matches newlines and spaces (has to be escaped) when there's more than two or more of them. The more thorough implementation is this:

re.sub('\ {2,}', '', yourstring)
re.sub('\n*', '', yourstring)

I would think the first would only replace multiple newlines, but it seems (at least for me) to work just fine.

Rafe Kettler
  • 75,757
  • 21
  • 156
  • 151
1

In case you came here after getting troubled by BeautifulSoup prettify(). I think this solution won't add extra spaces.

from lxml import html, etree

doc = html.fromstring(open('inputfile.html').read())
out = open('out.html', 'wb')
out.write(etree.tostring(doc))

Please see this Ian Bicking's answer on stackoverflow

Parsing via xml.etree is simple...

from xml.etree import ElementTree as ET
tree = ET.parse('out.html')
title = tree.find(".//title").text
print(title)
Pradeep Singh
  • 432
  • 5
  • 11