UnicodeEncodeError for UTF-8 file names returned from a tk dialog

Question

Hi I'm trying to extract audio from a video file using ffmpeg with the following function (in Python 2):

def extractAudio(path):
    command = ''.join(('ffmpeg -i "',path,'" -ab 160k -ac 2 -ar 44100 -vn audio.wav'))
    print(command)
    subprocess.call(command,shell=True)

the above print statement successfully prints the following:

ffmpeg -i "C:/Users/pruthvi/Desktop/vidrec/temp\TAEYEON 태연_ I (feat. Verbal Jint)_Music Video.mp4" -ab 160k -ac 2 -ar 44100 -vn audio.wav

but in the next statement, it fails and throws the following error:

Traceback (most recent call last):
  File "C:/Users/pruthvi/Desktop/vidrec/vidrec.py", line 53, in <module>
    main()
  File "C:/Users/pruthvi/Desktop/vidrec/vidrec.py", line 46, in main
    extractAudio(os.path.join(di,each))
  File "C:/Users/pruthvi/Desktop/vidrec/vidrec.py", line 28, in extractAudio
    subprocess.call(command,shell=True)
  File "C:\Python27\lib\subprocess.py", line 522, in call
    return Popen(*popenargs, **kwargs).wait()
  File "C:\Python27\lib\subprocess.py", line 710, in __init__
    errread, errwrite)
  File "C:\Python27\lib\subprocess.py", line 928, in _execute_child
    args = '{} /c "{}"'.format (comspec, args)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-57: ordinal not in range(128)

I've tried all the possible solutions from previous questions like encoding with correct type , setting up PYTHONIOENCODING etc., but none seems to work . If I convert it to ascii, it'll no longer function because it removes non-ascii character and ends up as file not found and the audio will not be extracted from the target file. Any help is appreciated, thanks :)

To experiment, you can use the following code:

# -*- coding: utf-8 -*-
import subprocess

def extractAudio():
    path = u'C:/Users/pruthvi/Desktop/vidrec/temp\TAEYEON 태연_ I (feat. Verbal Jint)_Music Video.mp4'
    command = ''.join(('ffmpeg -i "',path,'" -ab 160k -ac 2 -ar 44100 -vn audio.wav'))
    print(command)
    subprocess.call(command,shell=True)
extractAudio()

As a temporary solution, I copied the file to a non unicode name and then passing it to the `subprocess` , I'd still like to know how the above problem can be solved — Pruthvi Raj, Jan 17 '16 at 15:35
What happens if you get rid of `shell=True` and do `subprocess.call(['ffmpeg', '-i', path, '-ab', '160k', '-ac', '2', '-ar', '44100', '-vn', 'audio.wav'])`? — bbayles, Jan 17 '16 at 15:40
I notice the `path` has a backslash as well as a forward slash. I don't think that would cause the issue, but have you tried making it consistently forward slashes? — bbayles, Jan 17 '16 at 15:48
@bbayles I used `os.path.join(directory,file)`, it resulted so — Pruthvi Raj, Jan 17 '16 at 16:21
Hi, you should try to describe what you're actually doing in the title (i.e. passing arbitrary Unicode characters in arguments to a subprocess). This makes it easier for people with a similar problem to find this question. — roeland, Jan 18 '16 at 01:27
@roeland: Even at your reputation level, you could submit an edit suggestion. — tripleee, Jan 18 '16 at 04:43

Martin Konecny · Answer 1 · 2016-01-17T16:42:02.047

0

Because you are passing in a unicode string to the subprocess.call, Python tries to encode this to an encoding it thinks the filesystem/OS will understand. For some reason it's choose ASCII which is wrong.

You can try using the correct encoding via

subprocess.call(command.encode(sys.getfilesystemencoding()))

edited Jan 17 '16 at 16:42

answered Jan 17 '16 at 15:59

Martin Konecny

57,827
19
139
159

Perhaps also explain why and how to lose the `shell=True` abomination. – tripleee Jan 17 '16 at 16:16
This throws `TypeError: must be string without null bytes or None, not str ` – Pruthvi Raj Jan 17 '16 at 16:20
What happens when you encode to utf-8 instead? Also what's the output of `sys.getfilesystemencoding()`? – Martin Konecny Jan 17 '16 at 16:32
The OP is using Python 2.7, so he is passing in a byte string. – roeland Jan 18 '16 at 00:34

bobince · Answer 2 · 2016-01-18T10:55:09.187

Same as in your previous question: most cross-platform software on Windows can't handle non-ASCII characters in filenames.

Python's subprocess module uses interfaces based on byte strings. Under Windows, the command line is based on Unicode strings (technically UTF-16 code units), so the MS C runtime converts the byte strings into Unicode strings using an encoding (the ‘ANSI’ code page) that varies from machine to machine and which can never include all Unicode characters.

If your Windows installation were a Korean one, your ANSI code page would be 949 Korean and you would be able to write the command by saying one of:

subprocess.call(command.encode('cp949'))
subprocess.call(command.encode('mbcs'))

(where mbcs is short for ‘multi-byte character set’, which is a synonym for the ANSI code page on Windows.) If your installation isn't Korean you'll have a different ANSI code page and you will be unable to write that filename into a command as your command line encoding won't have any Hangul in it. The ANSI encoding is never anything sensible like UTF-8 so no-one can reliably use subprocess to execute commands with all Unicode characters in.

As discussed in the previous question, Python includes workarounds for Unicode filenames to use the native Win32 API instead of the C standard library. In Python 3 it also uses Win32 Unicode APIs for creating processes, but this is not the case back in Python 2. You could potentially hack something up yourself by calling the Win32 CreateProcessW command through ctypes, which gives you direct access to Windows APIs, but it's a bit of a pain.

...and it would be of no use anyway, because even if you did get non-ANSI characters into the command line, the ffmpeg command would itself fail. This is because ffmpeg is also a cross-platform application that uses the C standard libraries to read command lines and files. It would fail to read the Korean in the command line argument, and even if you got it through somehow it would fail to read the file of that name!

This is a source of ongoing frustration on the Windows platform: although it supports Unicode very well internally, most tools that run over the top of it can't. The answer should have been for Windows to support UTF-8 in all the byte-string interfaces it implements, instead of the sad old legacy ANSI code pages that no-one wants. Unfortunately Microsoft have repeatedly declined to take even the first steps towards making UTF-8 a first-class citizen on Windows (namely fixing some of the bugs that stop UTF-8 working in the console). Sorry.

Unrelated: this:

''.join(('ffmpeg -i "',path,'"...

is generally a bad idea. There are a number of special characters in filenames that would break that command line and possibly end up executing all kinds of other commands. If the input paths were controlled by someone untrusted that would be a severe security hole. In general when you put a command line together from variables you need to apply escaping to make the string safe for inclusion, and the escaping rules on Windows are complex and annoying.

You can avoid both the escaping problem and the Unicode problem by keeping everything inside Python. Instead of launching a command to invoke the ffmpeg code, you could use a module that brings the functionality of ffmpeg into Python, such as PyFFmpeg.

Or a cheap 'n' cheerful 'n' crappy workaround would be to copy/move the file to a known-safe name in Python, run the ffmpeg command using the static filename, and then rename/copy the file back...

I don't think this explanation is right, the docs actually mention that “On Windows, the class uses the Windows CreateProcess()”. It is just Python 2.x internally using byte strings to handle the command line. — roeland, Jan 18 '16 at 00:33
Link → https://docs.python.org/2.7/library/subprocess.html — roeland, Jan 18 '16 at 00:33
Thanks, updated—it is indeed fixed in Python 3, although upgrading still wouldn't help because of the other reasons listed here. — bobince, Jan 18 '16 at 10:55

roeland · Answer 3 · 2016-01-18T04:31:17.167

0

You have two problems:

Python 2

The subprocess module breaks when using Unicode in any of the arguments. This issue is fixed in Python 3, you can pass any Unicode file names and arguments to subprocess, and it will properly forward these to the child process.

ffmpeg

ffmpeg itself cannot open these files, something that you can easily verify by just trying to run it from the command line:

C:\temp>fancy αβγ.m4v
... lots of other output
fancy a�?.m4v: Invalid data found when processing input

(my code page is windows-1252, note how the Greek α got replaced with a Latin a)

You cannot fix this problem, but you can work around it, see bobince's answer.

edited Jan 18 '16 at 04:31

answered Jan 18 '16 at 01:25

roeland

5,349
2
14
28

`os.listdir(u'.')` returns Unicode filenames correctly in Windows on Python 2. – Mark Tolonen Jan 18 '16 at 04:05
@MarkTolonen true, I'll edit that out – roeland Jan 18 '16 at 04:31

UnicodeEncodeError for UTF-8 file names returned from a tk dialog

3 Answers3

Python 2

ffmpeg