Tarfile produce garbled file name in the .tar.gz archivement

Question

On windows, when there are chinese characters in file path, the chinese characters will be garbled in tar.gz and after decompression, they are still garbled.

So, did you try reading the `tarfile` docs and not understand the explanation there? Or did you just not bother trying? — abarnert, Nov 11 '13 at 09:44

score 0 · Answer 1 · answered Nov 11 '13 at 09:40

This is all explained in Unicode issues in the docs.

For all tarball formats before PAX—including the default format used by tarfile—filenames are stored in a "local filesystem encoding". The compressing program has to take a wild guess at what the decompressing program will want, and vice-versa. If you don't take a guess in your program, Python will do it for you, and guess UTF-8. See TarFile, which explains that it uses ENCODING if you don't specify anything, and ENCODING, which explains that it defaults to 'utf-8' on Windows.

So, there are three solutions:

Use PAX-format tarballs. This is easy; just pass format=tarfile.PAX_FORMAT to the TarFile constructor. (You can also set tarfile.DEFAULT_FORMAT = tarfile.PAX_FORMAT to change the default.) As long as the tool you're using to decompress understands PAX, you're set.
Figure out which encoding your decompression tool wants, and specify that explicitly by passing, e.g., format='big5' to the TarFile constructor. (You can also set tarfile.ENCODING='big5' to change the default.) Again, there's a good chance your tool uses your system's configured OEM charset, but no guarantee of that, and without knowing what tool you're using, I can't give any more details on how to figure it out.
Let Python use UTF-8, and convince your decompression tool to read UTF-8 instead of making a wild guess. Without knowing which tool you're using, I can't give any more detail than that.

Tarfile produce garbled file name in the .tar.gz archivement

1 Answers1