0

I am trying to create a duplicate file finder for Windows. My program works well in Linux. But it writes NUL characters to the log file in Windows. This is due to the MBCS default file system encoding of Windows, while the file system encoding in Linux is UTF-8. How can I convert MBCS to UTF-8 to avoid this error?

Eryk Sun
  • 33,190
  • 5
  • 92
  • 111
  • Windows file APIs are natively UTF-16 (actually UCS-2), so you should be using Unicode, e.g. `os.listdir(u'.')`. Then for writing to the log file, use `io.open` with `encoding='utf-8'`. Like this, no text should ever be `'mbcs'` (ANSI) encoded. – Eryk Sun Jan 02 '17 at 07:21
  • actually os.scandir() in python3.5 (not available upto 3.4) returns the MBCS equivalent of DirEntry.name, which i needs to be UTF-8 – achint chaudhary Jan 02 '17 at 07:34
  • No, `os.scandir` will return Unicode if you pass the path as a Unicode string. `bytes` paths aren't even allowed in Windows 3.5. If you upgrade to 3.6 you can go back to using bytes paths because we've overhauled the guts of the os module to use UTF-8 as the filesystem encoding for bytes paths on Windows. Internally 3.6 handles transcoding between UTF-16 (i.e. the Windows Unicode API) and UTF-8 for `bytes` consumers. – Eryk Sun Jan 02 '17 at 08:12
  • Actually, os.scandir() returna DirEntry, calling it's name variable as i.name is returning MBCS string, which when used to write to file of UTF-8 type encoding, yields NULL character to be printed – achint chaudhary Jan 02 '17 at 08:17
  • No, this is the last time I'm going to tell you that you're mistaken. The `DirEntry` `name` and `path` attributes are `str` Unicode strings. It's calling the Unicode (wide-character) file APIs (e.g. `FindFirstFileW`). No ANSI APIs are called. `bytes` paths are strictly forbidden with `os.scandir` in Windows Python 3.5. – Eryk Sun Jan 02 '17 at 08:26
  • Thanks, i got it, sorry for last time, i was in bit frustration that time – achint chaudhary Jan 03 '17 at 11:22

2 Answers2

2

Tell Python to use UTF-8 on the log file. In Python 3 you do this by:

open(..., encoding='utf-8')

If you want to convert an MBCS string to UTF-8 you can switch string encodings:

filename.encode('mbcs').decode('utf-8')

Use filename.encode(sys.getdefaultencoding())... to make the code work on Linux, as well.

zmbq
  • 38,013
  • 14
  • 101
  • 171
1

Just change the encode to 'latin-1' (encoding='latin-1')

Using pure Python: open(..., encoding = 'latin-1')

Using Pandas: pd.read_csv(..., encoding='latin-1')

Leandro Lima
  • 101
  • 1
  • 6