0

I am using a batch file to run a python script with command line arguments. (Yes I know python could do this on its own but I want to understand why this is not working) One argument is a file path with umlauts (ä, ü, ö). If I use the windows console and write the path with the keyboard everything works fine. If I try to use a batch file (run_script.bat) and test it with os.path.exists(filepath) it fails. I read a lot about encode and decode stuff but still no solution.

Example:

I wrote following code and saved it into a python module named parsepath.py:

import os
import sys
def main():
    fn = sys.argv[1]
    if os.path.exists(fn):
        print os.path.basename(fn)
        # file exists
    else:
        print 'Could not read path {}'.format(fn)


if __name__ == '__main__':
    print 'starting'
    main()

I created a folder named c:\täterätä. On a windows console I type in c:\python27\python.exe c:\path\to\parsepath.py "c:\täterätä"

This results in

starting
tõterõtõ

The Umlaut ä looks differently in the console but this is also an encoding problem between browser and console, I guess. Anyway this works.

If I put this in the batch file

"C:\python27\python.exe" "C:\path\to\parsepath.py" "c:\täteräta"

and run the batch file it does not work.

starting
Could not read path c:\t+±ter+±ta

It seemed that I have to decode somehow. I used an editor that shows the encoding and saved the batch file as "utf-8".

I tried to modify parsepath.py with

...
fn = sys.argv[1]
fn = fn.decode('utf-8')
...

but no luck. There is an error message.

Other encodings that work on a pure command line do not work with a batch file either:

fn = fn.decode('mbcs')

In some way the batch file changes the characters of the file path but I do not know in which way. I found out that the code page of the cmd.exe is 850. this is also what sys.stdout.encoding says (cp850).

If I print sys.argv the file path using pure command line input will be

['parsepath.py', 'c:\\t\xe4ter\xe4t\xe4']

If I print the file path inside sys.argv coming from the batch file it is:

['C:\\path\\to\\parsepath.py','c:\\t+\xf1ter+\xf1t+\xf1']

The difference is obvious: ä is represented as \xe4 but the batch file produces +\xf1

???

thopy
  • 77
  • 1
  • 8
  • There are two parts to this problem. The more important part for your script is that Python 2 cannot handle this in general on Windows. It uses the CRT's ANSI encoded command-line arguments. To get the correct Unicode command line, install the win_unicode_console package and enable it via `win_unicode_console.enable(use_unicode_argv=True)`. – Eryk Sun Mar 23 '18 at 23:59
  • The second part is related to batch scripts. The CMD shell is a Unicode application (e.g. it reads and writes Unicode to the console via `ReadConsoleW` and `WriteConsoleW`), but it reads from files and pipes as bytes and needs to select an encoding. For this it uses the console's current output codepage. You can set it to UTF-8 in a batch script via `chcp.com 65001 > nul`, but first save the current codepage to restore it. Also, the batch script needs to be saved as UTF-8 without a byte order mark (BOM). – Eryk Sun Mar 24 '18 at 00:07
  • Thanks for these hints. Actually the second suggestion did the trick. I changed codepage with chcp 65001 just before running "C:\python27\python.exe" "C:\path\to\parsepath.py" "c:\täteräta" inside the batch script and os.path.exists(sys.argv[1]) inside parsepath responded TRUE. – thopy Mar 27 '18 at 14:50
  • Switching the console to codepage 65001 (UTF-8) should only be done temporarily to read from a batch script or pipe. CMD doesn't use the codepage to read from and write to the console itself, but other programs do (e.g. Python 2 without win_unicode_console). The console's handling of 65001 is buggy. It limits `ReadConsoleA` and `ReadFile` to 7-bit ASCII in all versions of Windows, and `WriteConsoleA` and `WriteFile` return the wrong number of bytes written for non-ASCII output prior to Windows 8, which causes many programs (e.g. Python) to output garbage at the end of every print. – Eryk Sun Mar 27 '18 at 15:07
  • Also, that Python 2 happens to work with `"ä"` without enabling win_unicode_console's Unicode `sys.argv` is just a fortunate coincidence. Your ANSI codepage supports this character. But the codepage is only about 250 characters out of thousands of Unicode characters that will not be supported. – Eryk Sun Mar 27 '18 at 15:11
  • Yes, I am aware of the fact that chcp can make things worse. I read this dicussion [link](https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8). (OMG!). – thopy Mar 28 '18 at 06:56
  • Good to know that my working example may be a coincidence. Actually using the run script of win_unicode_console to enable it did not change anything in the first place. After that I tried to change the codepage and that worked but maybe I should check out some chinese letters or the like. I gues they will only work with your suggested combination of win_unicode_console and chcp. – thopy Mar 28 '18 at 07:04
  • `'täterätä'.encode('latin1').decode('cp850')` returns `'tõterõtõ'` so apparently your `bat` file is saved in `latin1` encoding and interpreted in `cp850` one. Try `chcp 1252` _before_ calling the `bat`. – JosefZ Dec 28 '20 at 20:33

0 Answers0