Processing non-UTF-8 Posix filenames using Python pathlib?

Question

I'm trying to use the pathlib module that became part of the standard library in Python 3.4+ to find and manipulate file paths. Although it's an improvement over the os.path style functions to be able to treat paths in an object-oriented way, I'm having trouble dealing with some more exotic filenames on Posix filesystems; specifically files whose names contain bytes that cannot be decoded as UTF-8:

>>> pathlib.PosixPath(b'\xe9')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/pathlib.py", line 969, in __new__
    self = cls._from_parts(args, init=False)
  File "/usr/lib/python3.5/pathlib.py", line 651, in _from_parts
    drv, root, parts = self._parse_args(args)
  File "/usr/lib/python3.5/pathlib.py", line 643, in _parse_args
    % type(a))
TypeError: argument should be a path or str object, not <class 'bytes'>

>>> b'\xe9'.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: unexpected end of data

The problem with this is that on a Posix filesystem, such files can exist, and I'd like to be able to process any filesystem-valid filenames in my application rather than cause errors and/or upredictable behaviour.

I can get a PosixPath object for such files inside a directory by using the .iterdir() method of the parent directory. But I have yet to find a way to get it from a full path that was provided as a variable of type 'bytes', which is rather hard to avoid when loading paths from another source which fully supports all filesystem-valid raw byte values (such as a database or a file containing nul-separated paths).

Is there a way to do this that I'm not aware of? Or, if it's really not possible: is this by design, or could it be considered a deficiency in the standard library that might warrant a bug report?

I did find a related bug report, but that issue concerned documentation incorrectly mentioning that arguments of class 'bytes' were allowed.

score 6 · Accepted Answer · answered Aug 17 '17 at 00:00

6

I think you can get what you want like this:

import os
PosixPath(os.fsdecode(b'\xe9'))

Demo:

>>> import os, pathlib
>>> b = b'\xe9'
>>> p = pathlib.Path(os.fsdecode(b))
>>> p.exists()
False
>>> with open(b, mode='w') as f:
...     f.write('wacky filename')
...     
>>> p.exists()
True
>>> p.read_bytes()
b'wacky filename'
>>> os.listdir(b'.')
[b'\xe9']

answered Aug 17 '17 at 00:00

wim

338,267
99
616
750

1

Thank you. Although I had seen 'surrogateencoding' mentioned in Python 3 discussions in passing, I was unaware that it allows any raw byte value to be represented as Unicode code points, and that that mechanism provides a workable solution to the problem I encountered. I do still feel that being able to use real 'bytes' would be a semantically more meaningful approach (the raw data isn't *really* Unicode after all). On the plus side, knowing about this trick will probably allow me to solve similar problemtic situations and help reduce the need for 'raw bytes' type persistent storage. – Arjen Meek Aug 17 '17 at 00:35
In this case, it is actually POSIX that I disagree with. We can have filenames with different encodings just sitting there in the same directory, it's a mess. Maybe one of the few things that Windows has done better than Linux these days. – wim Aug 17 '17 at 00:44
I do agree, having filenames as raw bytes is a rather dated approach (explained by the long history behind Posix, I guess) now that the entire online world has pretty much settled on Unicode / UTF-8. It does sound as though Python's approach using surrogate pairs for this might actually provide an avenue for Linux/UNIX systems to allow migrating to Unicode while maintaining backward compatibility. Then again, people are probably working on that already... – Arjen Meek Aug 17 '17 at 00:53

Processing non-UTF-8 Posix filenames using Python pathlib?

1 Answers1