-3

I am processing multiple .json files which I need to add to a single .zip archive using a package available here: https://github.com/larzconwell/bzip2.

I have referenced other possible solutions and questions related to io.Writer along with .Close() and .Flush()

Code that is used:

        if processedCounter%*filesInPackage == 0 || filesLeftToProcess == 0 {

            // Create empty zip file with numbered filename.
            emptyZip, err := os.Create(filepath.Join(absolutePathOutputDirectory, "package_"+strconv.Itoa(packageCounter)+".zip"))
            if err != nil {
                panic(err)
            }

            // Get list of .json filenames to be packaged:
            listOfProcessedJSON := listFiles(absolutePathInterDirectory, ".json")

            bzipWriter, err := bzip2.NewWriterLevel(emptyZip, 1)
            if err != nil {
                panic(err)
            }
            defer bzipWriter.Close()

            // Add listed files to the archive
            for _, file := range listOfProcessedJSON {
                // Read byte array from json file:
                JSONContents, err := ioutil.ReadFile(file)
                if err != nil {
                    fmt.Printf("Failed to open %s: %s", file, err)
                }

                // Write a single JSON to .zip:
                // Process hangs here!
                _, compressionError := bzipWriter.Write(JSONContents)
                if compressionError != nil {
                    fmt.Printf("Failed to write %s to zip: %s", file, err)
                    compressionErrorCounter++
                }

                err = bzipWriter.Close()
                if err != nil {
                    fmt.Printf("Failed to Close bzipWriter")
                }
            }

            // Delete intermediate .json files
            dir, err := ioutil.ReadDir(absolutePathInterDirectory)
            for _, d := range dir {
                os.RemoveAll(filepath.Join([]string{"tmp", d.Name()}...))
            }

            packageCounter++
        }

Using debugger it seems that the my program hangs on the following line:

_, compressionError := bzipWriter.Write(JSONContents)

The package itself does not provide usage examples so my knowledge is based on studying documentation, StackOverflow questions, and different available articles e.g.:

https://www.golangprograms.com/go-program-to-compress-list-of-files-into-zip.html

Let me know if anyone knows a possible solution to this problem.

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Kaszanas
  • 446
  • 4
  • 18
  • 5
    Bzip2 is a compression algorithm, not an archive. Are you certain you want an zip file with bzip2 as the compressor, when most things expect deflate? – JimB Jan 06 '21 at 01:29
  • @JimB This is a valid point right now it is set to ```.zip``` but I might change it to ```.bz2```. Although this still does not solve the problem that the program hangs and does not create an archive with bzip2 compressor algorithm. The program was tested on a sample of 9 ```.json``` files. – Kaszanas Jan 06 '21 at 01:33
  • 5
    Bzip2 itself is not an archive, so your code cannot create such a thing. You are not handling all errors, so any of those errors could cause problems. You are also trying to close the writer after every write which I don’t understand. – JimB Jan 06 '21 at 02:55
  • 2
    @Kaszanas JimB _is_ providing constructive ideas. He's pointing out the logical inconsistencies in your request, which make your request, as worded, literally impossible to fulfill. – Jonathan Hall Jan 06 '21 at 11:39
  • 1
    Your code appears to be concatenating the contents of a series of JSON files into a single bzip2 compressed file. In shell it is the equivalent of writing this "cat *.json | bzip2 >file.zip" So although the output file happens to have a ".zip" extension it isn't actually a zip archive. Is that the intention? If you need the JSON files to be accessed independently you probably want to use https://golang.org/pkg/archive/zip/ – pmqs Jan 06 '21 at 14:40
  • @pmqs The intention is to create a "packaged" archive which contains compressed data that is coming from multiple ```.json``` files, which after decompression can be accessed. In this context the extension itself is not important (because it is not the point). Goal is to achieve the highest compression possible. – Kaszanas Jan 06 '21 at 21:17

1 Answers1

2

You are confusing the formats and what they do, likely because they contain a the common substring "zip". zip is an archive format, intended to contain multiple files. bzip2 is a single-stream compressor, not an archive format, and can store only one file. gzip is the same as bzip2 in that regard. gzip, bzip2, xz, and other single-file compressors are all commonly used with tar in order to archive multiple files and their directory structure. tar collects the multiple files and structure into a single, uncompressed file, which is then compressed by the compressor of your choice.

The zip format works differently, where the archive format is on the outside, and each entry in the archive is individually compressed.

In any case, using a bzip2 package by itself will not be able to archive multiple files.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • What You are describing is indeed what I want. In that context I could individually compress every ```.json``` file by itself and add them to a compound "packaged" ```.zip``` archive. The goal is to achieve highest compression possible. Which when tested with 7Zip was indeed the "bzip2" compression algorithm. – Kaszanas Jan 06 '21 at 21:19
  • Then you need another library that will take the bz2 files generated and put them in a zip file. Whoever receives this will have to decompress the zip file to extract the bz2 files, and _then_ decompress each of the bz2 files individually. Is that what you intend? – Mark Adler Jan 06 '21 at 21:50
  • The original intention is to extract the archive and have ```.json``` files be ready to go but using the best available compression algorithm with high compression level. I have 150 GB of data which will grow x40 when parsed into JSON and needs to be compressed as much as possible. It is weird hearing that bzip2 cannot handle multiple files as it seems that 7Zip can add multiple files into an archive using bzip2 compression algorithm. – Kaszanas Jan 06 '21 at 23:28
  • 1
    As can PKZip. bzip2 is one of many _compression_ algorithms supported by the zip format. However bzip2 is not an _archiving_ algorithm. You cannot use the package you found to make an archive. Period. – Mark Adler Jan 07 '21 at 01:24
  • 1
    Have you tried 7Zip on your data using LZMA2? I suspect it would do better than bzip2, if you are looking for the best compression. – Mark Adler Jan 07 '21 at 01:26
  • Okay it seems a little clear now as to what I should look into. I guess then I need to use this package to create an archive: https://golang.org/pkg/archive/zip/ and then add the bzip2 compressed files into that archive one by one in a loop? I do not understand the Golang documentation as both .zip is using ```NewWriter``` and bzip2 is using ```NewWriter``` their purpose is not clear. – Kaszanas Jan 07 '21 at 03:45
  • @Kaszanas You need to update your question or ask a new question with how to compress bzip2 entries with the golang zip package. That package only natively supports stored and deflated entries. You will need to hook up a bzip2 compressor using `RegisterCompressor`. – Mark Adler Jan 07 '21 at 06:56
  • I have prepared minimal working example for the code using ```archive/zip``` and ```RegisterCompressor```. But it only works for available out of the box deflate compression algorithm ```zip.FileHeader{Method: 8)```. The question will soon be edited. – Kaszanas Jan 15 '21 at 06:11