5

I'm working on a Python 3 Tkinter app (OS is Windows 10) whose overall functionality includes the following details:

  1. Reading a number of text files which may contain data in ascii, cp1252, utf-8, or any other encoding

  2. Showing the contents of any of those files in a "preview window" (Tkinter Label widget).

  3. Writing the file contents to a single output file (opening to append each time)

For #1: I've made the file read encoding-agnostic simply by opening and reading the files in binary mode. To convert the data to a string I use a loop which runs through a list of 'likely' encodings and tries each of them in turn (with error='strict') until it hits one that doesn't throw an exception. This is working.

For #2: Once I've got the decoded string I just call the set() method for the Tkinter Label's textvariable. This is also working.

For #3: I'm opening an output file in the usual way and calling the write() method to write the decoded string. This works when the string was decoded as ascii or cp1252, but when it's decoded as utf-8 it throws an exception:

'charmap' codec can't encode characters in position 0-3: character maps to <undefined>

I've searched around and found fairly similar questions but nothing that seems to address this specific problem. Some further complications that restrict the solutions that will work for me:

A. I can sidestep the problem just by leaving the read-in data as bytes and opening/writing the output file as binary, but this renders some of the input file contents unreadable.

B. Although this app is mainly intended for Python 3, I'm trying to make it cross-compatible with Python 2 -- we have some slow/late adopters who will be using it. (BTW, when I run the app on Python 2 it also throws exceptions but does so for both the cp1252 data and the utf-8 data.)


For the sake of illustrating the issue, I've created this stripped-down test script. (My real application is a much larger project, and it's also proprietary to my company -- so it's not getting posted publicly!)

import tkinter as tk
import codecs

#Root window
root = tk.Tk()

#Widgets
ctrlViewFile1 = tk.StringVar()
ctrlViewFile2 = tk.StringVar()
ctrlViewFile3 = tk.StringVar()
lblViewFile1 = tk.Label(root, relief=tk.SUNKEN,
                        justify=tk.LEFT, anchor=tk.NW,
                        width=10, height=3,
                        textvariable=ctrlViewFile1)
lblViewFile2 = tk.Label(root, relief=tk.SUNKEN,
                        justify=tk.LEFT, anchor=tk.NW,
                        width=10, height=3,
                        textvariable=ctrlViewFile2)
lblViewFile3  = tk.Label(root, relief=tk.SUNKEN,
                         justify=tk.LEFT, anchor=tk.NW,
                         width=10, height=3,
                         textvariable=ctrlViewFile3)

#Layout
lblViewFile1.grid(row=0,column=0,padx=5,pady=5)
lblViewFile2.grid(row=1,column=0,padx=5,pady=5)
lblViewFile3.grid(row=2,column=0,padx=5,pady=5)

#Bytes read from "files" (ascii Az5, cp1252 European letters/punctuation, utf-8 Mandarin characters)
inBytes1 = b'\x41\x7a\x35'
inBytes2 = b'\xe0\xbf\xf6'
inBytes3 = b'\xef\xbb\xbf\xe6\x9c\xa8\xe5\x85\xb0\xe8\xbe\x9e'

#Decode
outString1 = codecs.decode(inBytes1,'ascii','strict')
outString2 = codecs.decode(inBytes2,'cp1252','strict')
outString3 = codecs.decode(inBytes3,'utf_8','strict')

#Assign stringvars
ctrlViewFile1.set(outString1)
ctrlViewFile2.set(outString2)
ctrlViewFile3.set(outString3)

#Write output files
try:
    with open('out1.txt','w') as outFile:
        outFile.write(outString1)
except Exception as e:
    print(inBytes1)
    print(str(e))

try:
    with open('out2.txt','w') as outFile:
        outFile.write(outString2)
except Exception as e:
    print(inBytes2)
    print(str(e))

try:
    with open('out3.txt','w') as outFile:
        outFile.write(outString3)
except Exception as e:
    print(inBytes3)
    print(str(e))

#Start GUI
tk.mainloop()
JDM
  • 1,709
  • 3
  • 25
  • 48
  • 1
    I you read bytes and write bytes, you should have an exact copy of the original file. When the output looks corrupted, doesn't the input look corrupted as well? On Windows, it might well be that your editor doesn't recognise UTF-8 and tries to interpret the bytes as CP-1252 characters. – lenz Feb 22 '19 at 23:04
  • For writing Py2/3 cross-compatible code, have a look at http://python-future.org/. Note: you should replace the `open(..., 'w')` calls with `io.open(..., 'w', encoding=...)` to achieve both Py2/3 and cross-platform compatibility. – lenz Feb 22 '19 at 23:11
  • @lenz, the output file has not only the contents of the various files but also lines of 'fixed' text that the app inserts. But you still make a good point, it may be that the editor (Notepad) just doesn't work & play well with multiple encodings within the same file. I'll check into that. – JDM Feb 23 '19 at 20:33
  • @lenz, looks like the `io` module is the way to do it. As Mark Tolonen mentioned below, explicit coding to UTF-8 fixes the `write()` problem and the `io` module supports that for 2 & 3 both. Go ahead and make it an 'official' answer, I'll accept it. – JDM Feb 23 '19 at 20:46

2 Answers2

10

I understand you want two things:

  • a way to write arbitrary Unicode characters to a file, and
  • Python 2/3 compatibility.

Using open('out1.txt','w') violates both:

  • The output text stream is opened with a default encoding, which happens to be CP-1252 on your platform (apparently Windows). This codec supports only a subset of Unicode, eg. lacking all emojis.
  • The open function differs considerably between Python versions. In Python 3, it is the io.open function, which offers a lot of flexibility, such as specifying a text encoding. In Python 2, the returned file handle processes 8-bit strings rather than Unicode strings (text).
  • There's also a portability issue of which you might not be aware: the default encoding for IO is platform dependent, ie. people running your code might see a different default depending on OS and localisation.

You can avoid all this with io.open('out1.txt', 'w', encoding='utf8'):

  • Use an encoding that supports all characters needed. Using the detected input encoding should work, unless processing introduces characters outside the supported range. Using one of the UTF codecs will always work, with UTF-8 being the most widely used for text files. Note that some Windows apps (like Notepad) tend not to understand UTF-8. There is the utf-8-sig codec that supports writing UTF-8 w/ BOM that makes Windows apps recognize files encoded in UTF-8. That codec also removes the UTF-8 BOM signature from the input stream if present when used for reading.
  • The io module was backported to Python 2.7. This generally qualifies as Py2/3 compatible, since support for versions <= 2.6 has ended quite some time ago.
  • Be explicit about the encoding used whenever opening text files. There might be scenarios where the platform-dependent default encoding makes sense, but usually you want control.

Side note: You mention a simple heuristic for detecting the input codec. If there's really no way to obtain this information, you should consider using chardet.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
lenz
  • 5,658
  • 5
  • 24
  • 44
  • Just ran a test and it works perfectly. I can take strings that were decoded to ascii, cp1252, or utf_8, and then `write()` those strings successfully to a file that was opened with `io.open(,'a',encoding='utf_8')` Thanks! – JDM Feb 24 '19 at 03:18
1

Be explicit. You've opened for write using a default encoding. Whatever it is, it doesn't support all Unicode code points. Open the file with UTF-8 encoding, which does support all Unicode code points:

import io
with io.open('out3.txt','w',encoding='utf8') as outFile:
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Not compatible with Python 2: `TypeError: 'encoding' is an invalid keyword argument for this function` – JDM Feb 23 '19 at 20:27
  • 1
    @JDM You tagged your question `python-3.x`. Use `io.open` for Python 2/3 compatibility. It is the same as Python 3's `open` but available in Python 2.7. – Mark Tolonen Feb 24 '19 at 08:06
  • Missing tag added. Sorry, I figured that the paragraph in item "B" was sufficiently clear. – JDM Feb 25 '19 at 14:04