2

I have a script that adds classes to heading tags using Beautiful Soup.

#!/usr/bin/env python
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('test.html'), 'html.parser')
heading_tags = soup.find_all('h1')
for tag in heading_tags:
    tag['class'].append('new-class')
with open('test.html', 'w') as html_doc:
    html_doc.write(soup.prettify())

This works well, but I would like to preserve the whitespace in the file when writing to it. For example, this Django Template:

<div class="something">
  <div class="else">
    <h1 class="original-class">Test</h1>
      {% if request.foo == 'bar' %}
      {{ line.get_something }}
      {% else %}
      {{ line.get_something_else }}
  </div>
</div>

Becomes:

<div class="something">
 <div class="else">
  <h1 class="original-class new-class">
   Test
  </h1>
  <!-- The formatting is off here: -->
  {% if request.foo == 'bar' %}
      {{ line.get_something }}
      {% else %}
      {{ line.get_something_else }}
 </div>
</div>

I also tried using soup.encode() rather than soup.prettify(). This preserves the Django template code, but flattens the HTML structure.

Is it possible to preserve the original file's whitespace when writing to a file with Beautiful Soup?

mc_kaiser
  • 717
  • 8
  • 18

1 Answers1

1

While this is a hack, the cleanest way I found was to monkey patch BeautifulSoup.pushTag:

#!/usr/bin/env python
from bs4 import BeautifulSoup

pushTag = BeautifulSoup.pushTag
def myPushTag(self, tag):
    pushTag(self, tag)
    self.preserve_whitespace_tag_stack.append(tag)

BeautifulSoup.pushTag = myPushTag

In BeautifulSoup, pushTag appends certain tags (just pre and textarea in beautifulsoup4) to preserve_whitespace_tag_stack. This monkey patch just over-rides that behavior so that all tags end up in preserve_whitespace_tag_stack.

I'd urge caution when using this, as there may be unintended consequences.

mc_kaiser
  • 717
  • 8
  • 18