2

PROBLEM SOLVED IN EDIT 3

I've been struggling with this problem for sometime. All of the questions here in SO or internet seems to work only on 'shallow' structures with one zip inside of another. However I have zip archive which structure is more or less something like this:

input.zip/ --1.zip/ --folder/ ----2.zip/ ------3.zip/ --------test/ ----------some-other-folder/ ----------archive.gz/ ------------filte-to-parse ----------file-to-parse3.txt ------file-to-parse.txt --4.zip/ ------folder/ and so on so on, my code needs to handle N-level of zips while preserving original zips, gzips, folders and files structure. Using temporary files is forbidden as of lack of privileges (this is something i'm not willing to change).

This is my code I wrote so far, however ZipOutputStream seems to operate only on one (top) level - in case of directories with files/dirs named exactly the same it throws Exception in thread "main" java.util.zip.ZipException: duplicate entry: folder/. It also skips empty directories (which is not expected). What I want to achieve is somehow move my ZipOutputStream to 'lower' level and do operations on each of zips. Maybe there's better approach to handle all of this problem, any help would be appreciated. I need to perform certain text extraction/modification later, however I'm not starting it yet until reading/writing whole structure is not working properly. Thanks in advance for any help!

    //constructor
private final File zipFile;

ArchiveResolver(String fileToHandle) {
    this.zipFile = new File(Objects.requireNonNull(getClass().getClassLoader().getResource(fileToHandle)).getFile());
}

void resolveInputFile() throws Exception {
    FileInputStream fileInputStream = new FileInputStream(this.zipFile);
    FileOutputStream fileOutputStream = new FileOutputStream("out.zip");
    ZipOutputStream zipOutputStream = new ZipOutputStream(fileOutputStream);
    ZipInputStream zipInputStream = new ZipInputStream(fileInputStream);

    zip(zipInputStream, zipOutputStream);

    zipInputStream.close();
    zipOutputStream.close();
}

//    this one doesn't preserve internal structure(empty folders), but can work on each file
private void zip(ZipInputStream zipInputStream, ZipOutputStream zipOutputStream) throws IOException {
    ZipEntry entry;
    while ((entry = zipInputStream.getNextEntry()) != null) {
        System.out.println(entry.getName());
        byte[] buffer = new byte[1024];
        int length;
        if (entry.getName().endsWith(".zip")) {
//              wrapping outer zip streams to inner streams making actual entries a new source
            ZipInputStream innerZipInputStream = new ZipInputStream(zipInputStream);
            ZipOutputStream innerZipOutputStream = new ZipOutputStream(zipOutputStream);

            ZipEntry zipEntry = new ZipEntry(entry.getName());
//              add new zip entry here to outer zipOutputStream: i.e. data.zip
            zipOutputStream.putNextEntry(zipEntry);

//              now treat this data.zip as parent and call recursively zipFolder on it
            zip(innerZipInputStream, innerZipOutputStream);

//              Finish internal stream work when innerZipOutput is done
            innerZipOutputStream.finish();

//              Close entry
            zipOutputStream.closeEntry();
        } else if (entry.isDirectory()) {
//              putting new zip entry into output stream and adding extra '/' to make
//              sure zipOutputStream will treat it as folder
            ZipEntry zipEntry = new ZipEntry(entry.getName() + "/");

//              this only should preserve internal structure
            zipOutputStream.putNextEntry(zipEntry);

//              reading everything from zipInputStream
            while ((length = zipInputStream.read(buffer)) > 0) {
//                  sending it straight to zipOutputStream
                zipOutputStream.write(buffer, 0, length);
            }

            zipOutputStream.closeEntry();

//              This else will include checking if file is respectively:
//              .gz file <- then open it, read from file inside, modify and save it
//              .txt file <- also read, modify and preserve
        } else {
//              create new entry on top of this
            ZipEntry zipEntry = new ZipEntry(entry.getName());
            zipOutputStream.putNextEntry(zipEntry);
            while ((length = zipInputStream.read(buffer)) > 0) {
                zipOutputStream.write(buffer, 0, length);
            }
            zipOutputStream.closeEntry();
        }
    }
}

//    This one preserves internal structure (empty folders and so)
//    BUT! no work on each file is possible it just preserves everything as it is
private void zipWhole(ZipInputStream zipInputStream, ZipOutputStream zipOutputStream) throws IOException {
    ZipEntry entry;
    while ((entry = zipInputStream.getNextEntry()) != null) {
        System.out.println(entry.getName());
        byte[] buffer = new byte[1024];
        int length;
        zipOutputStream.putNextEntry(new ZipEntry(entry.getName()));
        while ((length = zipInputStream.read(buffer)) > 0) {
            zipOutputStream.write(buffer, 0, length);
        }
        zipOutputStream.closeEntry();
    }
}

EDIT:

Updated my code to the newest version, still nothing to be proud of but did some changes however still not working... I've added here two very important comments about (in my opinion) code that fails. So I've tested two approaches - the first one is getting ZipInputStream from zipFile by using getInputStream(ZipEntry e); - throws Exception in thread "main" java.util.zip.ZipException: no current ZIP entry when I'm trying to put some entries to ZipOutputStream. The second approach focuses on "wrapping" ZipInputStream into one another -> this results in empty ZipInputStreams with no entries and application just goes through the files, list them (only top level of zips...) and finishes without saving anything into the out.zip file.

EDIT 2:

With a little suggestions from the people in the comments, I've decided to rewrite my code focusing to close, finish and closeEntry in appropriate places (I hope i did it better now). So right now I've achieved a little of something - code iterates through every entry, and saves it into out.zip file with proper zip packaging inside. Still skips empty folders tho, not sure why (I've checked some of the questions on stack and web, seems ok). Anyway thanks for help so far, I'll try to work this out and I'll keep this updated.

EDIT 3:

After few approaches to the problem and some reading + refactoring I've managed to solve this problem (however there's still problem while running this code on Linux - empty directories are skipped, seems to be connected to they way certain OS preserve file information?). Here's working solution:

    void resolveInputFile() throws IOException {
    FileInputStream fileInputStream = new FileInputStream(this.zipFile);
    FileOutputStream fileOutputStream = new FileOutputStream("in.zip");
    ZipOutputStream zipOutputStream = new ZipOutputStream(fileOutputStream);
    ZipInputStream zipInputStream = new ZipInputStream(fileInputStream);

    zip(zipInputStream, zipOutputStream);

    zipInputStream.close();
    zipOutputStream.close();
}

    private void zip(ZipInputStream zipInputStream, ZipOutputStream zipOutputStream) throws IOException {
    ZipEntry entry;
    while ((entry = zipInputStream.getNextEntry()) != null) {
        logger.info(entry.getName());

        if (entry.getName().endsWith(".zip")) {
            // If entry is zip, I create inner zip streams that wrap outer ones
            ZipInputStream innerZipInputStream = new ZipInputStream(zipInputStream);
            ZipOutputStream innerZipOutputStream = new ZipOutputStream(zipOutputStream);

            ZipEntry zipEntry = new ZipEntry(entry.getName());
            zipOutputStream.putNextEntry(zipEntry);

            zip(innerZipInputStream, innerZipOutputStream);
            //As mentioned in comments, proper streams needs to be properly closed/finished, I'm done writing to inner stream so I call finish() rather than close() which closes outer stream
            innerZipOutputStream.finish();
            zipOutputStream.closeEntry();

        } else if (entry.getName().endsWith(".gz")) {

            GZIPInputStream gzipInputStream = new GZIPInputStream(zipInputStream);
            //small trap while using GZIP - to save it properly I needed to put new ZipEntry to outerZipOutputStream BEFORE creating GZIPOutputStream wrapper
            ZipEntry zipEntry = new ZipEntry(entry.getName());
            zipOutputStream.putNextEntry(zipEntry);
            GZIPOutputStream gzipOutputStream = new GZIPOutputStream(zipOutputStream);
            //To make it as as much efficient as possible I've used BufferedReader
            BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(gzipInputStream));

            long start = System.nanoTime();
            logger.info("Started to process {}", zipEntry.getName());

            String line;
            while ((line = bufferedReader.readLine()) != null) {

                //PROCESSING LINE BY LINE...

                zipOutputStream.write((line + "\n").getBytes());
            }

            logger.info("Processing of {} took {} miliseconds", entry.getName() ,(System.nanoTime() - start) / 1_000_000);
            gzipOutputStream.finish();
            zipOutputStream.closeEntry();

        } else if (entry.getName().endsWith(".txt")) {

            ZipEntry zipEntry = new ZipEntry(entry.getName());
            zipOutputStream.putNextEntry(zipEntry);
            BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(zipInputStream));

            long start = System.nanoTime();
            logger.info("Started to process {}", zipEntry.getName());

            String line;
            while ((line = bufferedReader.readLine()) != null) {

                //PROCESSING LINE BY LINE...

                zipOutputStream.write((line + "\n").getBytes());
            }

            logger.info("Processing of {} took {} miliseconds", entry.getName() ,(System.nanoTime() - start) / 1_000_000);
            zipOutputStream.closeEntry();

        } else if (entry.isDirectory()) {
            //Standard directory preserving
            byte[] buffer = new byte[8192];
            int length;
            // Adding extra "/" to make sure it's dir
            ZipEntry zipEntry = new ZipEntry(entry.getName() + "/");
            zipOutputStream.putNextEntry(zipEntry);
            while ((length = zipInputStream.read(buffer)) > 0) {
                // sending it straight to zipOutputStream
                zipOutputStream.write(buffer, 0, length);
            }

            zipOutputStream.closeEntry();
        } else {
            //In my case it probably will never be called but if there's some different file in here it will be preserved unchanged in the output file
            byte[] buffer = new byte[8192];
            int length;
            ZipEntry zipEntry = new ZipEntry(entry.getName());
            zipOutputStream.putNextEntry(zipEntry);
            while ((length = zipInputStream.read(buffer)) > 0) {
                zipOutputStream.write(buffer, 0, length);
            }
            zipOutputStream.closeEntry();
        }
    }
}

Thanks again for all the help and good advices.

Macryo
  • 188
  • 2
  • 12
  • You need to `innerZipOutputStream.finish()` *before* you `zipOutputStream.closeEntry()` – Andreas Nov 03 '19 at 15:47
  • You need to wrap `zipOutputStream` in a `GZIPOutputStream`, unless you intended to decompress the content without removing the `.gz` extension. – Andreas Nov 03 '19 at 15:49
  • *FYI:* Since `read(...)` can never read more bytes than will fit in the buffer, `len` can never be more than `buf.length`, so those calls to `Math.min(...)` are redundant. – Andreas Nov 03 '19 at 15:51
  • Thanks for the tips Andreas, did some fixes to the code, it makes more sense right now however still struggling with "wrapping" input stream and treating it as a new source - reading anything from innerZIS just giv es null. Here's some changes https://pastebin.com/ZkkVd19T (will paste solution to the question when resolved) – Macryo Nov 04 '19 at 09:30
  • Don't link to your code. Edit the question and update the code you have. – Andreas Nov 04 '19 at 11:25
  • Yeah, silly me, fixed, question code is updated. – Macryo Nov 04 '19 at 20:50
  • Only call `finish()` when you are *done* writing to the `ZipOutputStream`. Don't call it before closing each entry. The sequence for writing a zip file is: 1) Open stream, 2) `putNextEntry(...)` 3) `write(...)` 4) `closeEntry()` *(optional)* 5) Repeat 2-4 as many times as need. 6) `finish()` *(optional)* 7) `close()`. --- Step 4 is optional since both `putNextEntry(...)` and `finish()` will close the previous entry for you. Step 6 is optional since `close()` will finish the stream for you, but is required if you don't close the zip stream, as you cannot do when writing nested zip streams. – Andreas Nov 05 '19 at 03:48
  • Thanks for all the help @Andreas I've decided to rewrite this code once again keeping strictly to the rules everyone here mentioned and see what happens. I hope it will get me further. – Macryo Nov 05 '19 at 08:50

1 Answers1

0

There seems to be a lot of debugging and refactoring to be done there.

There's an obvious problem that you are either not closing your streams/entries or doing so in the wrong order. Buffered data will get lost and the central directory not written. (There is a complication that Java streams unhelpfully close the stream they wrap, so there is finish vs close but it still needs to be done in the correct order).

Zip files have no representation for directories as they have a flat structure - the entire file path is included for each entry in both the local header and central directory.

The part of the Java zip library giving a random access interface uses memory mapped files, so you are stuck with streams for everything except, perhaps, the top level.

Tom Hawtin - tackline
  • 145,806
  • 30
  • 211
  • 305
  • Yeah I'm aware that I'm doing a lot wrong way... I was betting that it might be this invalid stream closing but when I was focusing mostly on closing each entry I was swarmed with `invalid entry size (expected 107 but got 105 bytes)` etc. I'll give some reading to that finish/close in streams tho maybe I'll find something. So far I know about them is `finish` closes certain inner entry compressed stream while `close` ... well closes stream. Thanks for tips however – Macryo Nov 03 '19 at 14:38
  • @MacRyze Those errors may be important. Not sure where they are coming from. You seem to be creating a new `ZipEntry` each time - if you were using the same one, then I guess the size may be wrong. (I also notice that there is some kind of bug dealing with zip with uncompressed data introduced in JDK13 but backported - I don't know the details.) – Tom Hawtin - tackline Nov 04 '19 at 13:42
  • thanks for the comment, well across this whole implementation I've seen errors with even huge entry size difference like `expected 2512 but got 0 bytes` - it all depends on the case, guess this is still my fault after all. Btw. I'm using OpenJDK 11.0.4 with JVM parameter -Xmx8M (I've tried changing available memory, not the case tho). – Macryo Nov 04 '19 at 20:49