3

I am trying to to extract files out of a nested zip archive and process them in memory.

What this question is not about:

  1. How to read a zip file in Java: NO, the question is how to read a zip file within a zip file within a zip and so on and so forth (as in nested zip files).

  2. Write temporary results on disk: NO, I'm asking about doing it all in memory. I found many answers using the not-so-efficient technique of writing results temporarily to disk, but that's not what I want to do.

Example:

Zipfile -> Zipfile1 -> Zipfile2 -> Zipfile3

Goal: extract the data found in each of the nested zip files, all in memory and using Java.

ZipFile is the answer, you say? NO, it is not, it works for the first iteration, that is for:

Zipfile -> Zipfile1

But once you get to Zipfile2, and perform a:

ZipInputStream z = new ZipInputStream(zipFile.getInputStream( zipEntry) ) ;

you will get a NullPointerException.

My code:

public class ZipHandler {

    String findings = new String();
    ZipFile zipFile = null;

    public void init(String fileName) throws AppException{

        try {
        //read file into stream
        zipFile = new ZipFile(fileName);  
        Enumeration<?> enu = zipFile.entries();  
        exctractInfoFromZip(enu);

        zipFile.close();
        } catch (FileNotFoundException e) {
        e.printStackTrace();

        } catch (IOException e) {
            e.printStackTrace();
    }
}

//The idea was recursively extract entries using ZipFile
public void exctractInfoFromZip(Enumeration<?> enu) throws IOException, AppException{   

    try {
        while (enu.hasMoreElements()) { 
            ZipEntry zipEntry = (ZipEntry) enu.nextElement();

            String name = zipEntry.getName();
            long size = zipEntry.getSize();
            long compressedSize = zipEntry.getCompressedSize();

            System.out.printf("name: %-20s | size: %6d | compressed size: %6d\n", 
                    name, size, compressedSize);

            // directory ?
            if (zipEntry.isDirectory()) {
                System.out.println("dir found:" + name);
                findings+=", " + name; 
                continue;
            } 

            if (name.toUpperCase().endsWith(".ZIP") ||  name.toUpperCase().endsWith(".GZ")) {
                String fileType = name.substring(
                        name.lastIndexOf(".")+1, name.length());

                System.out.println("File type:" + fileType);
                System.out.println("zipEntry: " + zipEntry);

                if (fileType.equalsIgnoreCase("ZIP")) {
//ZipFile here returns a NULL pointer when you try to get the first nested zip
                    ZipInputStream z = new ZipInputStream(zipFile.getInputStream(zipEntry) ) ;
                    System.out.println("Opening ZIP as stream: " + name);

                    findings+=", " + name;

                    exctractInfoFromZip(zipInputStreamToEnum(z));
                } else if (fileType.equalsIgnoreCase("GZ")) {
//ZipFile here returns a NULL pointer when you try to get the first nested zip      
                    GZIPInputStream z = new GZIPInputStream(zipFile.getInputStream(zipEntry) ) ;
                    System.out.println("Opening ZIP as stream: " + name);

                    findings+=", " + name;

                    exctractInfoFromZip(gZipInputStreamToEnum(z));
                } else
                    throw new AppException("extension not recognized!");
            } else {
                System.out.println(name);
                findings+=", " + name;
            }
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    System.out.println("Findings " + findings);
} 

public Enumeration<?> zipInputStreamToEnum(ZipInputStream zStream) throws IOException{

    List<ZipEntry> list = new ArrayList<ZipEntry>();    

    while (zStream.available() != 0) {
        list.add(zStream.getNextEntry());
    }

    return Collections.enumeration(list);
} 
Ilmari Karonen
  • 49,047
  • 9
  • 93
  • 153
R TheWriter
  • 163
  • 2
  • 8
  • 3
    _"(will edit soon)"_ - Do not post partial questions. Wait until you have the complete question formulated before posting. – Jim Garrison Nov 09 '17 at 18:19
  • Your main problem is that you first have to seek to the correct zip entry using `getNextEntry()` on the ZipInputStream. – JMax Nov 09 '17 at 19:23
  • Sorry for posting wronlgy. Will correct the post when I get from work. I am stuck with this really – R TheWriter Nov 09 '17 at 22:27
  • In the future, please consider turning your code into a [mcve]; that is, preferably, a single .java file with a main() method that we can compile and run to reproduce the problem. – Ilmari Karonen Nov 10 '17 at 15:06
  • Anyway, your real problem seems to be that your `zipFile` variable always refers to the outermost file. It shouldn't really be a surprise that trying to pass a ZipEntry from one of the inner ZipInputStreams into `zipFile.getInputStream()` will fail. (The only surprise is that it doesn't throw an IllegalArgumentException.) Since there doesn't seem to be any way to instantiate a ZipFile from an InputStream, it seems your best option is to abandon ZipFile and just work directly with ZipInputStreams, like JMax suggests below. – Ilmari Karonen Nov 10 '17 at 15:12
  • Thanks for your comments, will try to make the code better next time – R TheWriter Nov 10 '17 at 17:49

3 Answers3

6

I have not tried it but using ZipInputStream you can read any InputStream that contains a ZIP file as data. Iterate through the entries and when you found the correct entry use the ZipInputStreamto create another nestedZipInputStream`.

The following code demonstrates this. Imagine we have a readme.txt inside 0.zip which is again zipped in 1.zip which is zipped in 2.zip. Now we read some text from readme.txt:

try (FileInputStream fin = new FileInputStream("D:/2.zip")) {
    ZipInputStream firstZip = new ZipInputStream(fin);
    ZipInputStream zippedZip = new ZipInputStream(findEntry(firstZip, "1.zip"));
    ZipInputStream zippedZippedZip = new ZipInputStream(findEntry(zippedZip, "0.zip"));

    ZipInputStream zippedZippedZippedReadme = findEntry(zippedZippedZip, "readme.txt");
    InputStreamReader reader = new InputStreamReader(zippedZippedZippedReadme);
    char[] cbuf = new char[1024];
    int read = reader.read(cbuf);
    System.out.println(new String(cbuf, 0, read));
    .....

public static ZipInputStream findEntry(ZipInputStream in, String name) throws IOException {
    ZipEntry entry = null;
    while ((entry = in.getNextEntry()) != null) {
        if (entry.getName().equals(name)) {
            return in;
        }
    }
    return null;
}

Note the code is really ugly and does not close anything nor does it checks for errors. It is just a minimized version that demonstrates how it works.

Theoretically there is no limit how many ZipInputStreams you cascade into another. The data is never written into a temporary file. The decryption is only performed on-demand when you read each InputStream.

JMax
  • 1,134
  • 1
  • 11
  • 20
  • Jmax, i don't know the name of the files within the original input.zip file The input zip file with nested zips contains files, a directory structure and zip files. E.g. input.zip Root Folder --file.txt --inputNested.zip --Nested Folder ----file2.wht ----inputNested2.zip and so on and so forth – R TheWriter Nov 10 '17 at 12:47
  • Well then enumerate through the entries, check the entry name/filename and do whatever you want with it. – JMax Nov 11 '17 at 13:08
  • Hi i know this answer is quite old @j – Thebestshoot May 16 '18 at 13:15
2

this is the way I found to unzip file in memory:

The code is not clean AT ALL, but i understand the rules are to post something working, so i have this hopefully to help so

What I do is use a recursive method to navigate the complex ZIP file and extract folder other inner zips files and save the results in memory to later work with them.

Main things I found I want to share with you:

1 ZipFile is useless if you have nested zip files 2 You have to use the basic Zip InputStream and Outputstream 3 I only use recursive programming to unzip nested zips

package course.hernan;

import java.io.BufferedInputStream;

import java.io.BufferedOutputStream;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.ArrayDeque;
import java.util.Deque;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
import java.util.zip.ZipOutputStream;

import org.apache.commons.io.IOUtils;

public class FileReader {

private static final int  BUFFER_SIZE = 2048;

    public static void main(String[] args) {
        try {
            File f = new File("DIR/inputs.zip");
            FileInputStream fis = new FileInputStream(f);
            BufferedInputStream bis = new BufferedInputStream(fis);
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            BufferedOutputStream bos = new BufferedOutputStream(baos);
            byte[] buffer = new byte[BUFFER_SIZE];
            while (bis.read(buffer, 0, BUFFER_SIZE) != -1) {
               bos.write(buffer);
            }

            bos.flush();
            bos.close();
            bis.close();

            //This STACK has the output byte array information 
            Deque<Map<Integer, Object[]>> outputDataStack = ZipHandler1.unzip(baos);


        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}    
package course.hernan;

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.util.ArrayDeque;
import java.util.ArrayList;
import java.util.Deque;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.SortedMap;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;

import org.apache.commons.lang3.StringUtils;

public class ZipHandler1 {

  private static final int BUFFER_SIZE = 2048;

  private static final String ZIP_EXTENSION = ".zip";
  public static final Integer FOLDER = 1;
  public static final Integer ZIP = 2;
  public static final Integer FILE = 3;


  public static Deque<Map<Integer, Object[]>> unzip(ByteArrayOutputStream zippedOutputFile) {

    try {

      ZipInputStream inputStream = new ZipInputStream(
          new BufferedInputStream(new ByteArrayInputStream(
              zippedOutputFile.toByteArray())));

      ZipEntry entry;

      Deque<Map<Integer, Object[]>> result = new ArrayDeque<Map<Integer, Object[]>>();

      while ((entry = inputStream.getNextEntry()) != null) {

        LinkedHashMap<Integer, Object[]> map = new LinkedHashMap<Integer, Object[]>();
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
        System.out.println("\tExtracting entry: " + entry);
        int count;
        byte[] data = new byte[BUFFER_SIZE];

        if (!entry.isDirectory()) {
          BufferedOutputStream out = new BufferedOutputStream(
              outputStream, BUFFER_SIZE);

          while ((count = inputStream.read(data, 0, BUFFER_SIZE)) != -1) {
            out.write(data, 0, count);
          }

          out.flush();
          out.close();

          //  recursively unzip files
          if (entry.getName().toUpperCase().endsWith(ZIP_EXTENSION.toUpperCase())) {
            map.put(ZIP, new Object[] {entry.getName(), unzip(outputStream)});
            result.add(map);
            //result.addAll();
          } else { 
            map.put(FILE, new Object[] {entry.getName(), outputStream});
            result.add(map);
          }
        } else {
          map.put(FOLDER, new Object[] {entry.getName(), unzip(outputStream)});
          result.add(map);
        }
      }

      inputStream.close();

      return result;

    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }
R TheWriter
  • 163
  • 2
  • 8
1

Thanks to JMax. In my case, The result of reading the pdf file is different from the expected result, It becomes bigger and cannot be opened. Finally I found that I had made a mistake, The buffer may not be full, The following is the error code.

   while((n = zippedZippedZippedReadme.read(buffer)) != -1) {
                fos.write(buffer);
            }

Here is the correct code,

    try (FileInputStream fin = new FileInputStream("1.zip")) {
    ZipInputStream firstZip = new ZipInputStream(fin);
    ZipInputStream zippedZip = new ZipInputStream(findEntry(firstZip, "0.zip"));
    ZipInputStream zippedZippedZippedReadme = findEntry(zippedZip, "test.pdf");
    long startTime = System.currentTimeMillis();
      byte[] buffer = new byte[4096];
        File outputFile = new File("test.pdf");
        try (FileOutputStream fos = new FileOutputStream(outputFile)) {
            int n;
            while((n = zippedZippedZippedReadme.read(buffer)) != -1) {
                fos.write(buffer, 0 ,n);
            }
            fos.flush();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        
        System.out.println("time consuming:" + (System.currentTimeMillis() - startTime)/1000.0);
    }

hope to be helpful!

yilin
  • 77
  • 2