Encoding errors when compressing files with Apache Commons Compression on Linux

Question

I am compressing files using the Apache Commons API Compression. Windows 7 works fine, but in Linux (ubuntu 10.10 - UTF8), characters in file names and folder names, such as "º", for example, are replaced by "?".

Is there any parameter I should pass to the API when compact, or when uncompressing tar?

I'am using tar.gz format, following the API examples.

The files I'm trying compress, are created in windows... is there any trouble?

The code:

    public class TarGzTest 
    {

    public static void createTarGzOfDirectory(String directoryPath, String tarGzPath) throws IOException
    {
        System.out.println("Criando tar.gz da pasta " + directoryPath + " em " + tarGzPath);
        FileOutputStream fOut = null;
        BufferedOutputStream bOut = null;
        GzipCompressorOutputStream gzOut = null;
        TarArchiveOutputStream tOut = null;

        try
        {
            fOut = new FileOutputStream(new File(tarGzPath));
            bOut = new BufferedOutputStream(fOut);
            gzOut = new GzipCompressorOutputStream(bOut);
            tOut = new TarArchiveOutputStream(gzOut);

            addFileToTarGz(tOut, directoryPath, "");
        }
        finally
        {
            tOut.finish();
            tOut.close();
            gzOut.close();
            bOut.close();
            fOut.close();
        }
        System.out.println("Processo concluído.");
    }

    private static void addFileToTarGz(TarArchiveOutputStream tOut, String path, String base) throws IOException
    {
        System.out.println("addFileToTarGz()::"+path);
        File f = new File(path);
        String entryName = base + f.getName();
        TarArchiveEntry tarEntry = new TarArchiveEntry(f, entryName);

        tOut.setLongFileMode(TarArchiveOutputStream.LONGFILE_GNU);

        if(f.isFile())
        {
            tOut.putArchiveEntry(tarEntry);

            IOUtils.copy(new FileInputStream(f), tOut);

            tOut.closeArchiveEntry();
        }
        else
        {
            File[] children = f.listFiles();

            if(children != null)
            {
                for(File child : children)
                {
                    addFileToTarGz(tOut, child.getAbsolutePath(), entryName + "/");
                }
            }
        }
    }
}

(I suppress the main method;)

EDIT (monkeyjluffy) : The changes that I made are to have always the same archive on different platform. Then the hash calculated on it is the same.

Do you mean that when you decompress, the file isn't the same as it was? Please show the exact code you're using. — Jon Skeet, Jul 19 '11 at 18:47
Could it be related to how CR o LF are represented in Windows vs Linux?? — Marsellus Wallace, Jul 19 '11 at 18:47
@jon-skeet I edited the question, added code and some info.. — caarlos0, Jul 19 '11 at 18:53
@caarlos0: Okay, so that's the compression part... and decompression? How are you viewing the "bad" files? — Jon Skeet, Jul 19 '11 at 18:54
@jon-skeet i'am decompressing with "tar xzvf file.tar.gz"... — caarlos0, Jul 19 '11 at 18:56
@caarlos0: And what happens with binary files? Have you tried looking at the differences in a binary file editor? — Jon Skeet, Jul 19 '11 at 18:58
@jon-skeet it happens in the file names, folder names, etc... With binary files I not sure... — caarlos0, Jul 19 '11 at 19:11
@caarlos0: Ah. You didn't mention that it was the *names* rather than the *data*. That's a completely different matter. I've no idea what the problem is, but I strongly suggest you edit your question to make it clearer for whoever looks next. — Jon Skeet, Jul 19 '11 at 19:13
@caarlos: sadly non-ASCII characters in filenames are a gigantic headache. We've got build scripts set to fail immediately should someone try to build/commit a file using non-ASCII (or spaces, etc.) characters in filenames. For example, say I give files named *"r̶ǫ1", "r̶ǫ2" and "r̶ǫ3" and you need to find all files starting with "r̶ǫ" (say from the command-line or from Spotlight etc.), how do you do it and how intelligent was it to use non-ASCII characters in filenames now? In a mix of OS X / Windows / Linux / webapps (they're a kind of platform too) you're in for a world of hurt. — SyntaxT3rr0r, Jul 19 '11 at 19:41
@carrlos: I don't know if it's ironic or sad or something else but without making it on purpose, I realize the non-ASCII characters I just put in my comment screw my Chrome / Linux / StackOverflow display, with my comment above not being properly rendered (it overflows and goes into the right margin). See why non-ASCII in filenames (and in *.java* source file too btw) is a headache? No metada + no ASCII = gigantic headaches. Just say no to it. — SyntaxT3rr0r, Jul 19 '11 at 19:43
@SyntaxT3rr0r: i understand you, but most of users dont. So, sadly, i have to do a workaround for it. =( — caarlos0, Jul 20 '11 at 10:59

score 1 · Accepted Answer · answered Jul 20 '11 at 13:52

I found a workaround for my trouble.

For some reason, java doesn't respects my environment's encoding, and change it to cp1252.

After that I uncompress the file, I just enter in it folder, and ran this command:

convmv --notest -f cp1252 -t utf8 * -r

And it converts everything recursively to UTF-8.

Problem solved, guys.

more info about encoding problems in linux here.

Thanks everyone for the help.

Encoding errors when compressing files with Apache Commons Compression on Linux

1 Answers1

Linked