2

RESOLVED: Problem had to do with Python version, refer to stackoverflow.com/a/5513856/2540382

I am fiddling with htm -> txt file conversion and am having a little trouble. My project is essentially to convert the messages.htm file I downloaded of my Facebook chat history into a messages.txt file with all the <> brackets removed and formatting preserved.

The file messages.htm is parsed into variable text.

I then run:

target = open('output.txt', 'w')
target.write(text)
target.close

This seems to work except when I hit an invalid character. As seen in the error below. Is there a way to either:

  1. Skip the line with the invalid character while writing?

  2. Figure out where the invalid characters are and remove the corresponding character or line?

The desired outcome is to avoid having strange characters all together if possible.

return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U000fe333' in position 37524: character
maps to <undefined>
user3768533
  • 1,317
  • 2
  • 13
  • 21

1 Answers1

3
target = open('output.txt', 'wb')
target.write(text.encode('ascii', 'ignore'))
target.close()

For the "errors" argument to .encode(..), 'ignore' will strip out those characters, and 'replace' will replace them with '?'.

To test this, I replaced the write line with

target.write(u"foo\U000fe333bar".encode("ascii", "ignore"))

and confirmed that output.txt contained only "foobar".

UPDATE: I edited the open(.., 'w') to open(.., 'wb') to make sure this would work in Python 3 as well.

Ken Geis
  • 904
  • 6
  • 17
  • Hmm I get this error: File "html2text.py", line 693, in wrapwrite target.write(text.encode('ascii', 'ignore')) TypeError: write() argument must be str, not bytes – user3768533 Nov 02 '15 at 14:57
  • What type is "text"? I tested it with a string and Python 2.7.10. – Ken Geis Nov 02 '15 at 15:40
  • Sorry for the delayed response, I added print(type(text)) to the code. Cmd is telling me that the type is string. C:\Users\kevin\Desktop\workspace>python html2text.py part1.htm – user3768533 Nov 07 '15 at 21:45
  • 1
    It appears to be a change between Python 2 and Python 3. http://stackoverflow.com/a/5513856/2540382 – Ken Geis Nov 08 '15 at 17:42
  • not woork for me f.write(towrite.encode("ascii", "ignore")) TypeError: write() argument must be str, not bytes – Arslan Ahmad khan Aug 06 '17 at 06:38