UnicodeEncodeError: 'utf-8' codec can't encode character

Question

When running below command from ansible tower i'm hitting UnicodeEncodeError utf-8

Command:

pip3 install -r /requirements.txt --no-index --find-links /

Error as below:

"stderr_lines": 
        "WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.",
        "Exception:",
        "Traceback (most recent call last):",
        "  File \"/usr/lib/python3.6/site-packages/pip/basecommand.py\", line 215, in main",
        "    status = self.run(options, args)",
        "  File \"/usr/lib/python3.6/site-packages/pip/commands/install.py\", line 346, in run",
        "    requirement_set.prepare_files(finder)",
        "  File \"/usr/lib/python3.6/site-packages/pip/req/req_set.py\", line 381, in prepare_files",
        "    ignore_dependencies=self.ignore_dependencies))",
        "  File \"/usr/lib/python3.6/site-packages/pip/req/req_set.py\", line 557, in _prepare_file",
        "    require_hashes",
        "  File \"/usr/lib/python3.6/site-packages/pip/req/req_install.py\", line 278, in populate_link",
        "    self.link = finder.find_requirement(self, upgrade)",
        "  File \"/usr/lib/python3.6/site-packages/pip/index.py\", line 465, in find_requirement",
        "    all_candidates = self.find_all_candidates(req.name)",
        "  File \"/usr/lib/python3.6/site-packages/pip/index.py\", line 386, in find_all_candidates",
        "    self.find_links, expand_dir=True)",
        "  File \"/usr/lib/python3.6/site-packages/pip/index.py\", line 236, in _sort_locations",
        "    sort_path(os.path.join(path, item))",
        "  File \"/usr/lib/python3.6/site-packages/pip/index.py\", line 216, in sort_path",
        "    url = path_to_url(path)",
        "  File \"/usr/lib/python3.6/site-packages/pip/download.py\", line 466, in path_to_url",
        "    url = urllib_parse.urljoin('file:', urllib_request.pathname2url(path))",
        "  File \"/usr/lib64/python3.6/urllib/request.py\", line 1689, in pathname2url",
        "    return quote(pathname)",
        "  File \"/usr/lib64/python3.6/urllib/parse.py\", line 891, in quote",
        "    string = string.encode(encoding, errors)",
        "UnicodeEncodeError: 'utf-8' codec can't encode character '\\udcfd' in position 1: surrogates not allowed"

Does your requirements.txt contain some weird characters or is it saved in some exotic encoding…? — deceze, Jul 13 '23 at 05:07

score 0 · Answer 1 · answered Jul 13 '23 at 08:05

You hit a corner case of Python3 language. Note: this corner case has more opinions than real hits.

Python internally uses Unicode (forget about real encoding inside Python, it was BTW changed during 3.x, so the internal representation/encoding doesn't matter).

The problem: sometime we have binary data or random bytes on a normal stream, which Python uses expect to be a string (so with Unicode representation). It happens in sys.argv, in environment variables, and in filenames (from the stack in your questions, it seems that a filename trigger this error).

So, how to combine the text is Unicode, which seldom cases where there is not way to represent data in Unicode, without losing data (e.g. if you need to pass it to other processes, or doing your own correction).

Python decided to use surrogate codes. (I think wrongly, because there were other way described in Unicode, like using codepoints outside allowed Unicode characters, which it is recommended way of Unicode for internal data).

Surrogates are 16-bit codepoints used to encode more then 16-bit codepoints: we have a pair of surrogates, and using the bits of both surrogates, we can have all Unicode code points (so around 20 bits). Because a decoder should never get such code points (but it should merge the surrogates and gives you the codepoints above 16bit limit), no Unicode strings should have surrogates.

So Python decided to encode invalid bytes into surrogates (the option, if you have files with mixed encoding, is to use the option surrogateescape. So bytes that cannot be decoded will be put as U+DCxx (and xx must be > 0x80, else we have plain ASCII).

In such manner, we can use Unicode (which works 99.99% of the cases), but we can handle also special cases without losing data, and without much surprises.

You hit the special case. There is two way: either the pip will learn about surrogates and explicitly encode with surrogateescape (so getting the sequence of bytes of original filename), but it may cause other problems because then configuration file is not a valid Unicode text, or just to throw and error, and let users to correct the error.

Note: it is a very bad thing to use pip with --find-links /. Really never traverse all files in /. But things happen (not just your case): you may have loops, access to devices, causing dynamic mounts, etc. Uses --find-links outside your domain (projects, files) only if you know what you are looking for, and where (so e.g. a "team user" folder, or some common folders). But consider also the security risk on using python packages outside your control (if you have a multi user system. It you have a running webserver, consider it as multi-user system). And also consider the first line (just warning, but it is really a bad thing).

UnicodeEncodeError: 'utf-8' codec can't encode character

1 Answers1