0

I am trying to use the Apache Commons Compress to read the content of a 7-zip file. I'm not interested in reading/extracting the content, I just want to get the list of all the entries.

I made this code, but with 4MB archives it takes 6 seconds to read the whole file.

public static void main(String[]args) throws IOException{
    File sevenz = new File("testfile.7z");
    System.out.println("Reading 7-zip...");
    SevenZFile sevenZFile = new SevenZFile(sevenz);
    long s = System.currentTimeMillis();
    SevenZArchiveEntry entry;
    while((entry=sevenZFile.getNextEntry())!=null){
        System.out.print(entry.isDirectory()?"Dir":"File");
        System.out.print("\t");
        System.out.print("*********.***"); //entry.getName();
        System.out.print("\t");
        System.out.println(entry.getHasCrc()?"CRC":"NO-CRC");
    }
    System.out.println("------------------------------");
    System.out.println("7-zip\t"+(System.currentTimeMillis()-s)+" ms to read.");

}   

The output is:

Reading 7-zip...
File    *********.***   CRC
File    *********.***   CRC
File    *********.***   CRC
File    *********.***   CRC
File    *********.***   CRC
------------------------------
7-zip   6236 ms to read.

Is the file listing process supposed to take all this time or am I doing something wrong? I also tried to remove all the prints, but the time it takes to read the file is the same.

Vektor88
  • 4,841
  • 11
  • 59
  • 111

1 Answers1

1

That does seem a little on the high side. The first thing I would do would be to remove extraneous effort and time only the reading portion.

That means commenting out all the System.out.println commands inside the loop:

while ((entry = sevenZFile.getNextEntry()) != null) {
}
System.out.println("total\t" + (System.currentTimeMillis()-s) + " ms.");

Do that and see if it makes a difference. That will tell you whether it's the entry scanning itself or the printing and/or extraction of the data from each entry.

Beyond that, you can find out how long each iteration takes with:

while ((entry = sevenZFile.getNextEntry()) != null) {
    long s2 = System.currentTimeMillis();
    System.out.println("entry\t" + (s2-s) + " ms.");
    s = s2;
}

I have a vague recollection that Apache Commons Compress read the entire list of entries on start and that appears to be the case based on the source code here.

One possibility would be to grab that source code, incorporate it as is in your own code temporarily, then profile it to see where it's spending most of the time during instantiation.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • It takes the same amount of time. – Vektor88 Oct 09 '14 at 08:12
  • I changed the code to see how long it takes to get to each single file and the problem is with `getNextEntry` after the largest file (16MB uncompressed). The list of entries is generated when the `SevenZFile` object is created, but seems that `getNextEntry` does something to "prepare" the file contents to be read and there's no option to disable this. Probably this is the problem. – Vektor88 Oct 09 '14 at 08:36