0

I am getting the following error :

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 587: ordinal not in range(128)

My code:

import os
from bs4 import BeautifulSoup

do = dir_with_original_files = 'C:\Users\Me\Directory'
dm = dir_with_modified_files = 'C:\Users\Me\Directory\New'
for root, dirs, files in os.walk(do):
    for f in files:
        if f.endswith('~'): #you don't want to process backups
            continue
        original_file = os.path.join(root, f)
        mf = f.split('.')
        mf = ''.join(mf[:-1])+'_mod.'+mf[-1] # you can keep the same name 
                                             # if you omit the last two lines.
                                             # They are in separate directories
                                             # anyway. In that case, mf = f
        modified_file = os.path.join(dm, mf)
        with open(original_file, 'r') as orig_f, \
             open(modified_file, 'w') as modi_f:
            soup = BeautifulSoup(orig_f.read())
            for t in soup.find_all('td', class_='test'):
                t.string.wrap(soup.new_tag('h2'))
            # This is where you create your new modified file.
            modi_f.write(soup.prettify())

This code is iterating over a directory, and for each file finds all of the tds of class test and adds h2 tags to the text within the td. So previously, it would have been :

<td class="test"> text </td>

After running this program, a new file will be created with :

<td class="test"> <h2>text</h2> </td>

Or this is how I would like it to function. Unfortunately, currently, I am getting the error described above. I believe this is because I am parsing some text which includes accented characters and is written in Spanish, with special Spanish characters.

What can I do to fix my issue?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Simon Kiely
  • 5,880
  • 28
  • 94
  • 180

1 Answers1

1

soup.prettify() returns a Unicode string, but your file expects a byte string. Python tries to help here and automatically encodes the result, but your Unicode string contains codepoints that are beyond the ASCII standard and thus the encoding fails.

You'll have to either manually encode to a different codec, or use a different file object type that'll do this automatically for you.

In this case, I'd encode to the original encoding that BeautifulSoup detected for you:

modi_f.write(soup.prettify().encode(soup.original_encoding))

The soup.original_encoding reflects what the BeautifulSoup decoded the unmodified HTML as, and is based (if at all available) on the encoding that the HTML itself declared, or an educated guess based on statistical analysis of the bytes of the original data.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Many thanks for all your responses to my questions. I am starting to get the grip of this now! Unfortunately, when I try and employ your solution here, I get the error : modi_f.write(soup.prettify().encoding(soup.original_encoding)) AttributeError: 'unicode' object has no attribute 'encoding' – Simon Kiely Dec 05 '14 at 11:29
  • 1
    @SimonKiely: mea culpa, that was a spelling mistake on my part. The method used is [`unicode.encode()`](https://docs.python.org/2/library/stdtypes.html#str.encode). – Martijn Pieters Dec 05 '14 at 11:30