32

Is there any way to define custom indent width for .prettify() function? From what I can get from it's source -

def prettify(self, encoding=None, formatter="minimal"):
    if encoding is None:
        return self.decode(True, formatter=formatter)
    else:
        return self.encode(encoding, True, formatter=formatter)

There is no way to specify indent width. I think it's because of this line in the decode_contents() function -

s.append(" " * (indent_level - 1))

Which has a fixed length of 1 space! (WHY!!) I tried specifying indent_level=4, that just results in this -

    <section>
     <article>
      <h1>
      </h1>
      <p>
      </p>
     </article>
    </section>

Which looks just plain stupid. :|

Now, I can hack this away, but I just want to be sure if there is anything I'm missing. Because this should be a basic feature. :-/

If you have some better way of prettifying HTML codes, let me know.

smci
  • 32,567
  • 20
  • 113
  • 146
Bibhas Debnath
  • 14,559
  • 17
  • 68
  • 96
  • 1
    In answer to your side question ("WHY!"): HTML and XML tend to be very, very deeply nested, and I'm guessing the Crummy guys like 80-column windows. But you might want to post to the mailing list/group and/or file a bug requesting this feature (and, since the patch is pretty simple—and ramabodhi already pretty much wrote it for you—you should include it with your email/bug report). – abarnert Mar 20 '13 at 01:20
  • 2
    It looks like someone submitted a similar patch against 3.2 to the mailing list a couple years ago. See [here](https://groups.google.com/forum/?fromgroups=#!topic/beautifulsoup/B4qryJpJqpY). – abarnert Mar 20 '13 at 01:37
  • 2
    "1-space indent looks just plain stupid. :|" - Thank you. This is exactly what I was thinking when I was searching for this issue. – Brandin Aug 24 '15 at 18:37

4 Answers4

27

I actually dealt with this myself, in the hackiest way possible: by post-processing the result.

r = re.compile(r'^(\s*)', re.MULTILINE)
def prettify_2space(s, encoding=None, formatter="minimal"):
    return r.sub(r'\1\1', s.prettify(encoding, formatter))

Actually, I monkeypatched prettify_2space in place of prettify in the class. That's not essential to the solution, but let's do it anyway, and make the indent width a parameter instead of hardcoding it to 2:

orig_prettify = bs4.BeautifulSoup.prettify
r = re.compile(r'^(\s*)', re.MULTILINE)
def prettify(self, encoding=None, formatter="minimal", indent_width=4):
    return r.sub(r'\1' * indent_width, orig_prettify(self, encoding, formatter))
bs4.BeautifulSoup.prettify = prettify

So:

x = '''<section><article><h1></h1><p></p></article></section>'''
soup = bs4.BeautifulSoup(x)
print(soup.prettify(indent_width=3))

… gives:

<html>
   <body>
      <section>
         <article>
            <h1>
            </h1>
            <p>
            </p>
         </article>
      </section>
   </body>
</html>

Obviously if you want to patch Tag.prettify as well as BeautifulSoup.prettify, you have to do the same thing there. (You might want to create a generic wrapper that you can apply to both, instead of repeating yourself.) And if there are any other prettify methods, same deal.

abarnert
  • 354,177
  • 51
  • 601
  • 671
6

As far as I can tell, this feature is not built in, as there are a handful of solutions out there for this problem.

Assuming you are using BeautifulSoup 4, here are the solutions I came up with

Hardcode it in. This requires minimal changes, this is fine if you don't need the indent to be different in different circumstances:

myTab = 4 # add this
if pretty_print:
   # space = (' ' * (indent_level - 1))
    space = (' ' * (indent_level - myTab))
    #indent_contents = indent_level + 1
    indent_contents = indent_level + myTab 

Another problem with the previous solution is that the text content wont be indented entirely consistently, but attractively, still. If you need a more flexible/consistent solution, you can just modify the class.

Find the prettify function and modify it as such (it is located in the Tag class in element.py):

#Add the myTab keyword to the functions parameters (or whatever you want to call it), set it to your preferred default.
def prettify(self, encoding=None, formatter="minimal", myTab=2): 
    Tag.myTab= myTab # add a reference to it in the Tag class
    if encoding is None:
        return self.decode(True, formatter=formatter)
    else:
        return self.encode(encoding, True, formatter=formatter)

And then scroll up to the decode method in the Tag class and make the following changes:

if pretty_print:
    #space = (' ' * (indent_level - 1))
    space = (' ' * (indent_level - Tag.myTab))
    #indent_contents = indent_level + Tag.myTab 
    indent_contents = indent_level + Tag.myTab

Then go to the decode_contents method in the Tag class and make these changes:

#s.append(" " * (indent_level - 1))
s.append(" " * (indent_level - Tag.myTab))

Now BeautifulSoup('<root><child><desc>Text</desc></child></root>').prettify(myTab=4) will return:

<root>
    <child>
        <desc>
            Text
        </desc>
    </child>
</root>

**No need to patch BeautifulSoup class as it inherits the Tag class. Patching Tag class is sufficient enough to achieve the goal.

tryexceptcontinue
  • 1,587
  • 15
  • 13
  • This should be very easy to convert into a patch against the bs4 source tree, which is handy. The OP can just make his own fork of the bzr tree and patch it, submit the patch upstream, etc. – abarnert Mar 20 '13 at 01:30
  • Thanks guys. I just couldn't believe only one person had a problem with this in these years and proposed a patch, and it is still not merged. I have already modified the function to take variable length(as I hate hard coding things). It pretty much does what you have suggested. But the thing is you need to provide something for `indent_level` because of this line `pretty_print = (indent_level is not None)` And as I see the default value of `indent_level` is `None` and there is no dynamic way to change it. <_ – Bibhas Debnath Mar 20 '13 at 06:27
5

Beautiful Soup has output formatters. bs4.formatter.HTMLFormatter allows to specify indent.

>>> import bs4
>>> s = '<section><article><h1></h1><p></p></article></section>'
>>> formatter = bs4.formatter.HTMLFormatter(indent=2)
>>> print(bs4.BeautifulSoup(s, 'html.parser').prettify(formatter=formatter))
<section>
  <article>
    <h1>
    </h1>
    <p>
    </p>
  </article>
</section>

You can also use it from command-line with pyfil (e.g. to integrate with Geany's "Send Selection to" feature):

pyfil 'bs4.BeautifulSoup(stdin, "html.parser").prettify(formatter=bs4.formatter.HTMLFormatter(indent=2))'
saaj
  • 23,253
  • 3
  • 104
  • 105
  • I get the following error: `TypeError: __init__() got an unexpected keyword argument 'indent'`. I've searched far and wide across the internet but nothing came of it. – Alex Mandelias Oct 28 '22 at 22:50
  • 1
    You must be using an old version of the package. `indent` was added in [this commit](https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/revision/629). According to the [changelog](https://bazaar.launchpad.net/%7Eleonardr/beautifulsoup/bs4/view/head:/CHANGELOG) it's released since version 4.11.0. – saaj Oct 29 '22 at 11:15
  • Turns out `pip install` does not automatically upgrade to the latest versions if the package is already installed. `pip install -U` is needed for this job. Thanks @saaj – Alex Mandelias Oct 31 '22 at 09:47
3

Here's a way to increase indentation w/o meddling with original functions, etc. Create the following function:

# Increase indentation of 'text' by 'n' spaces
def add_indent(text,n):
  sp = " "*n
  lsep = chr(10) if text.find(chr(13)) == -1 else chr(13)+chr(10)
  lines = text.split(lsep)
  for i in range(len(lines)):
    spacediff = len(lines[i]) - len(lines[i].lstrip())
    if spacediff: lines[i] = sp*spacediff + lines[i] 
  return lsep.join(lines)

Then convert the text you obtained using the above function:

x = '''<section><article><h1></h1><p></p></article></section>'''
soup = bs4.BeautifulSoup(x, 'html.parser')  # I don't know if you need 'html.parser'
text = soup.prettify()                      # I do, otherwise I get a warning
text = add_indent(text,1) # Increase indentation by 1 space 
print(text)
'''
Output:
<html>
  <body>
    <section>
      <article>
        <h1>
        </h1>
        <p>
        </p>
      </article>
    </section>
  </body>
</html>
'''
Apostolos
  • 3,115
  • 25
  • 28