2

I need to edit hundreds of .html files with beautifulSoup 4.

My CSS formatting is lost when I write back the changes to file.

Before prettify(): enter image description here

And prettify(): enter image description here

My code:

from bs4 import BeautifulSoup
import os

files = []
path = r"C:\Files"

for file in os.listdir(path):
    if file.endswith('.html'):
        files.append(file)

for htmlfile in files:
    soup = BeautifulSoup(open(htmlfile, encoding="utf-8"), "html.parser")

    soup.header.decompose()
    soup.menu.decompose()

    pretty_html = soup.prettify('utf-8', 'minimal')
    with open(htmlfile, "wb") as outfile:
        outfile.write(pretty_html)

If I don't prettify() and write is out as below:

with open(file, "w") as outfile:
    outfile.write(str(soup))

I get an encoding error:

outfile.write(str(soup))
File "...env\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 2027: character maps to <undefined>

Seems to be "utf-8" to "cp1252" enconding issue.

I can't wrap my head around this encoding stuff.

jwpfox
  • 5,124
  • 11
  • 45
  • 42
joke4me
  • 812
  • 1
  • 10
  • 29
  • You can make a soup instance using the right encoding that the file is in. Also, you may not need to open the file for writing in binary mode. Or are you looking to convert stored markup to `utf-8` encoding? – Oluwafemi Sule Jan 05 '18 at 02:24
  • File is in utf-8 encoding and i'm encoding in utf-8 for the soup instance as well. – joke4me Jan 05 '18 at 06:05
  • Also tried writing without opening in binary format, still get encoding error, see later part of OP. – joke4me Jan 05 '18 at 06:07
  • Alright, try calling prettify with no formatter e.g. `soup.prettify(formatter=None)`. This should keep strings from being modified . Write the output to file – Oluwafemi Sule Jan 05 '18 at 06:20
  • unfortunately that didn't help either, css formatting is still lost after writing to file – joke4me Jan 05 '18 at 06:56
  • Other than taking out the menu and the header, is it a requirement for the document to be prettified? If not, can you share a [Paste](https://pastebin.com) for one of the document for anyone to work with. – Oluwafemi Sule Jan 05 '18 at 07:00
  • 1
    An example url, try to save this to a .html via beautifulsoup and the CSS formatting in the code snippets `
    ` is all lost https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Inheritance.html
    – joke4me Jan 05 '18 at 09:00
  • Thanks for your help, I don't really need to prettify() I just want to edit and save the changes to automate thousands of .html – joke4me Jan 05 '18 at 09:03
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/162640/discussion-between-oluwafemi-sule-and-joke4me). – Oluwafemi Sule Jan 06 '18 at 02:17

0 Answers0