I'm working on a Python 3 Tkinter app (OS is Windows 10) whose overall functionality includes the following details:
Reading a number of text files which may contain data in ascii, cp1252, utf-8, or any other encoding
Showing the contents of any of those files in a "preview window" (Tkinter Label widget).
Writing the file contents to a single output file (opening to append each time)
For #1: I've made the file read encoding-agnostic simply by opening and reading the files in binary mode. To convert the data to a string I use a loop which runs through a list of 'likely' encodings and tries each of them in turn (with error='strict'
) until it hits one that doesn't throw an exception. This is working.
For #2: Once I've got the decoded string I just call the set()
method for the Tkinter Label's textvariable
. This is also working.
For #3: I'm opening an output file in the usual way and calling the write()
method to write the decoded string. This works when the string was decoded as ascii or cp1252, but when it's decoded as utf-8 it throws an exception:
'charmap' codec can't encode characters in position 0-3: character maps to <undefined>
I've searched around and found fairly similar questions but nothing that seems to address this specific problem. Some further complications that restrict the solutions that will work for me:
A. I can sidestep the problem just by leaving the read-in data as bytes and opening/writing the output file as binary, but this renders some of the input file contents unreadable.
B. Although this app is mainly intended for Python 3, I'm trying to make it cross-compatible with Python 2 -- we have some slow/late adopters who will be using it. (BTW, when I run the app on Python 2 it also throws exceptions but does so for both the cp1252 data and the utf-8 data.)
For the sake of illustrating the issue, I've created this stripped-down test script. (My real application is a much larger project, and it's also proprietary to my company -- so it's not getting posted publicly!)
import tkinter as tk
import codecs
#Root window
root = tk.Tk()
#Widgets
ctrlViewFile1 = tk.StringVar()
ctrlViewFile2 = tk.StringVar()
ctrlViewFile3 = tk.StringVar()
lblViewFile1 = tk.Label(root, relief=tk.SUNKEN,
justify=tk.LEFT, anchor=tk.NW,
width=10, height=3,
textvariable=ctrlViewFile1)
lblViewFile2 = tk.Label(root, relief=tk.SUNKEN,
justify=tk.LEFT, anchor=tk.NW,
width=10, height=3,
textvariable=ctrlViewFile2)
lblViewFile3 = tk.Label(root, relief=tk.SUNKEN,
justify=tk.LEFT, anchor=tk.NW,
width=10, height=3,
textvariable=ctrlViewFile3)
#Layout
lblViewFile1.grid(row=0,column=0,padx=5,pady=5)
lblViewFile2.grid(row=1,column=0,padx=5,pady=5)
lblViewFile3.grid(row=2,column=0,padx=5,pady=5)
#Bytes read from "files" (ascii Az5, cp1252 European letters/punctuation, utf-8 Mandarin characters)
inBytes1 = b'\x41\x7a\x35'
inBytes2 = b'\xe0\xbf\xf6'
inBytes3 = b'\xef\xbb\xbf\xe6\x9c\xa8\xe5\x85\xb0\xe8\xbe\x9e'
#Decode
outString1 = codecs.decode(inBytes1,'ascii','strict')
outString2 = codecs.decode(inBytes2,'cp1252','strict')
outString3 = codecs.decode(inBytes3,'utf_8','strict')
#Assign stringvars
ctrlViewFile1.set(outString1)
ctrlViewFile2.set(outString2)
ctrlViewFile3.set(outString3)
#Write output files
try:
with open('out1.txt','w') as outFile:
outFile.write(outString1)
except Exception as e:
print(inBytes1)
print(str(e))
try:
with open('out2.txt','w') as outFile:
outFile.write(outString2)
except Exception as e:
print(inBytes2)
print(str(e))
try:
with open('out3.txt','w') as outFile:
outFile.write(outString3)
except Exception as e:
print(inBytes3)
print(str(e))
#Start GUI
tk.mainloop()