On windows, when there are chinese characters in file path, the chinese characters will be garbled in tar.gz and after decompression, they are still garbled.
Asked
Active
Viewed 712 times
-2
-
python3, you can see it in the title – Timothy Kwok Nov 11 '13 at 09:32
-
So, did you try reading the `tarfile` docs and not understand the explanation there? Or did you just not bother trying? – abarnert Nov 11 '13 at 09:44
1 Answers
0
This is all explained in Unicode issues in the docs.
For all tarball formats before PAX—including the default format used by tarfile
—filenames are stored in a "local filesystem encoding". The compressing program has to take a wild guess at what the decompressing program will want, and vice-versa. If you don't take a guess in your program, Python will do it for you, and guess UTF-8. See TarFile
, which explains that it uses ENCODING
if you don't specify anything, and ENCODING
, which explains that it defaults to 'utf-8'
on Windows.
So, there are three solutions:
- Use PAX-format tarballs. This is easy; just pass
format=tarfile.PAX_FORMAT
to theTarFile
constructor. (You can also settarfile.DEFAULT_FORMAT = tarfile.PAX_FORMAT
to change the default.) As long as the tool you're using to decompress understands PAX, you're set. - Figure out which encoding your decompression tool wants, and specify that explicitly by passing, e.g.,
format='big5'
to theTarFile
constructor. (You can also settarfile.ENCODING='big5'
to change the default.) Again, there's a good chance your tool uses your system's configured OEM charset, but no guarantee of that, and without knowing what tool you're using, I can't give any more details on how to figure it out. - Let Python use UTF-8, and convince your decompression tool to read UTF-8 instead of making a wild guess. Without knowing which tool you're using, I can't give any more detail than that.

abarnert
- 354,177
- 51
- 601
- 671