3

I'm using the Apache Commons 1.4.1 library to uncompress ".tar" files.

Problem: I don't have to extract all files. I have to extract specific files from specific location inside a tar archive. i have to extract only few .xml files where as the size of the TAR file is around 300 MB & it is waste of resource in uncompressing the entire content.

I am stuck up & confused whether i have to do a nested directory compare or is there is any way around?

Note: location of the .XML(required files) is always same.

The structure of the TAR is:

directory:E:\Root\data
 file:E:\Root\datasheet.txt
directory:E:\Root\map
     file:E:\Root\mapers.txt
directory:E:\Root\ui
     file:E:\Root\ui\capital.txt
     file:E:\Root\ui\info.txt
directory:E:\Root\ui\sales
     file:E:\Root\ui\sales\Reqest_01.xml
     file:E:\Root\ui\sales\Reqest_02.xml
     file:E:\Root\ui\sales\Reqest_03.xml
     file:E:\Root\ui\sales\Reqest_04.xml
directory:E:\Root\ui\sales\stores
directory:E:\Root\ui\stores
directory:E:\Root\urls
directory:E:\Root\urls\fullfilment
     file:E:\Root\urls\fullfilment\Cams_01.xml
     file:E:\Root\urls\fullfilment\Cams_02.xml
     file:E:\Root\urls\fullfilment\Cams_03.xml
     file:E:\Root\urls\fullfilment\Cams_04.xml
directory:E:\Root\urls\fullfilment\profile
directory:E:\Root\urls\fullfilment\registration
     file:E:\Root\urls\options.txt
directory:E:\Root\urls\profile

Constraint: i cant use JDK 7 & have to stick with Apache commons library.

My current Solution:

public static void untar(File[] files) throws Exception {
        String path = files[0].toString();
        File tarPath = new File(path);
        TarEntry entry;
        TarInputStream inputStream = null;
        FileOutputStream outputStream = null;
        try {
            inputStream = new TarInputStream(new FileInputStream(tarPath));
            while (null != (entry = inputStream.getNextEntry())) {
                int bytesRead;
                System.out.println("tarpath:" + tarPath.getName());
                System.out.println("Entry:" + entry.getName());
                String pathWithoutName = path.substring(0, path.indexOf(tarPath.getName()));
                System.out.println("pathname:" + pathWithoutName);
                if (entry.isDirectory()) {
                    File directory = new File(pathWithoutName + entry.getName());
                    directory.mkdir();
                    continue;
                }
                byte[] buffer = new byte[1024];
                outputStream = new FileOutputStream(pathWithoutName + entry.getName());
                while ((bytesRead = inputStream.read(buffer, 0, 1024)) > -1) {
                    outputStream.write(buffer, 0, bytesRead);
                }
                System.out.println("Extracted " + entry.getName());
            }

        }
Wills
  • 491
  • 8
  • 20

1 Answers1

3

The TAR file format is designed to be written or read as a stream (ie, to/from a tape drive), and does not have a centralized header. So no, there's no way around reading the entire file to extract individual entries.

If you want random access, you should use the ZIP format, and open using the JDK's ZipFile. Assuming that you have enough virtual memory, the file will be memory-mapped, making random access very fast (I haven't looked to see if it will use a random-access file if unable to memory-map).

parsifal
  • 1,645
  • 9
  • 6
  • I have no option of using a zip file here. – Wills Jan 22 '13 at 18:07
  • @Wills - in that case, you're stuck reading through the entire file. You might try adding a `BufferedInputStream` around the `FileInputStream` for increased performance (although I suspect `TarInputStream` buffers internally). And since you're already using Jakarta Commons, I recommend replacing your copy loop with `IOUtils.copy()`. – parsifal Jan 22 '13 at 18:10
  • 1
    “I haven't looked to see if it will use a random-access file if unable to memory-map”. Assuming you’re talking about the reference implementation, it *only* uses a `RandomAccessFile`. There is no memory mapped I/O in `ZipFile` at all. – Holger Jun 22 '21 at 14:16