3

I have tried to gzip a large(100mb to 500mb) xml file.I have created method Zip to do that. the issues is that its talking too much time to zip.for 200mb it take 1.2 secs.i need to reduce the time too 100 millisecond for 100mb xml file. how do i optimize to reduce the time for zipping?

I have reduced the time by compromising little on compression ratio. tried another algorithms like Snappy,Lz4 but not much improvement and also they have poor compression.as per my knowledge gzipOutputStream.write() takes 85% of the time.so how can i optimize this step to get better performance with out compromising much of compression Ratio.

public static String zip(final String str) {
    if ((str == null) || (str.length() == 0)) {
        throw new IllegalArgumentException("Cannot zip null or empty string");
    }

    try (ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(str.length())) {
        try (GZIPOutputStream gzipOutputStream = new GZIPOutputStream(byteArrayOutputStream){{def.setLevel(Deflater.BEST_SPEED );}};) {
            gzipOutputStream.write(str.getBytes(StandardCharsets.UTF_8));

        } 
            T5 = System.currentTimeMillis();
            byte[] bytes=byteArrayOutputStream.toByteArray();
             T3 = System.currentTimeMillis();

            String zipped_text=DatatypeConverter.printBase64Binary(bytes);
             T4 = System.currentTimeMillis();
            return zipped_text;

    } catch(IOException e) {
        throw new RuntimeException("Failed to zip content", e);
    }

}
  • 1
    Remove compression, messure time - this will be probably your time asymptote. – Antoniossss May 06 '19 at 12:04
  • You're essentially asking how to make a piece of code which is in no way optimised for speed to be 12 times faster. The answer is: use a compression algorithm with implementation optimised for speed. And then your hardware may still be a bottleneck. – Gimby May 06 '19 at 12:10
  • In each step you are always processing the whole block. 100MB is too large to fit in any CPU cache. Therefore process the data in blocks of ~500KB and directly redirect the output using streams, e.g. use Base64OutputStream from apache commons codec. – Robert May 06 '19 at 12:18
  • @Robert I am new to java Can you provide the optimized code or an example for this.thank you for your time – SHUHAIB AREEKKAN May 07 '19 at 07:53
  • Don't ty to do this all in memory. Write it to the target file,or socket or whatever as you go. – user207421 May 07 '19 at 10:36

1 Answers1

1

Here is my advice:

  1. Create a proper benchmark so that you can get repeatable results. I would advise using a benchmarking framework; e.g. JMH.

  2. Profile your code / benchmark to identify where the bottlenecks / hotspots are; e.g. using jVisualVM or Java Mission Control Flight Recorder.

  3. Use the benchmarks and profiling results to guide your optimization effort.

(I would NOT rely simply on calls to System.currentTimeMillis() for a variety of reasons.)

One possible explanation is that a significant percentage of the time is spent on data copying in the following steps.

  • Creating the input string containing the XML
  • Capturing the compressed bytes in a ByteArrayOutputStream
  • Concerting the bytes into another String.

So if you are looking for ways to improve this, try to arrange things so that the XML serializer writes to a pipeline that streams data through gzip and base64 conversion and then writes directly to a file or socket stream.

Also, I would avoid using base64 if possible. If the compressed XML is in an HTTP response, you should be able to send it in binary. It will be faster, and generate significantly less network traffic.

Finally, pick a compression algorithm that gives a good compromise between compression ratio and compression time.


How can I optimize this step to get better performance with out compromising the compression ratio.

If you are trying to do that, your goals are probably wrong. (And why did you then Base64 encode the compressed file? That contradicts your goal!)


Updates to address your comments:

  1. You will (I think) get better performance by streaming than by turning your XML into a String and then calling getBytes() on it. For a start, the getBytes() call is making an unnecessary copy of the string content.

  2. The Wikipedia page on Lossless Compression links to a number of algorithms, many of which should have readily available Java implementations. In addition, it has links to a number of benchmarks. I haven't looked at the benchmark links, but I expect at least one will quantify the compression versus compute time trade-off for different algorithms.

  3. If you change the database table from CLOB to BLOB:

    • you can dispense with the base64, saving ~25% storage space
    • you can dispense with the base64 encoding step, saving a few percent of CPU
    • you can then pick a faster (but less compact) algorithm, saving more time at the cost of some of the space that you saved by going to a BLOB.
  4. "I can't really change it its business requirement." - Really? If the database schema is a business requirement, then there is something really screwed up with your business. And on the flip-side, if the business is dictating the technology at that level, then they are also dictating the performance.

    There is no sound technical reason to store compress data as CLOBs.

  5. As someone noted, the easiest way to get faster compression is to buy a faster computer. Or (my idea) a bank of computers so that you can compress multiple files in parallel.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • thanks for your answer ,base64 is used to store the String in database as CLOB.I can't really change it its business requirement .I can compromise on 10% of compression ratio for the better performance, Because I am new to Java Kindly provide Examples or Links related to your answer.it will be helpful for me.Thank for your time – SHUHAIB AREEKKAN May 07 '19 at 07:49
  • gzipOutputStream.write(str.getBytes(StandardCharsets.UTF_8)); is talking 85% time of the program how can i optimize for this step. – SHUHAIB AREEKKAN May 07 '19 at 08:07
  • @SHUHAIB To speed-up this step buy a faster PC (with high single thread performance). Compression is computation intensive and therefore takes some time. – Robert May 07 '19 at 09:50
  • While the answer in general is certainly profound, I see the (single, specific) advice to use JMH here with some scrutiny: It is mainly a *microbenchmarking* framework, and my gut feeling is that trying to use it for a benchmark that involves lots of I/O and large, complex third-party functionalities might not be the best idea to start with. A profiler run with jVisualVM or the Java Mission Control Flight Recorder should more easily and quickly bring the relevant insights here. – Marco13 May 07 '19 at 11:03
  • @Marco13 - It depends. I suspect that the OP is already testing with a benchmark. But the chances are that he is not full up to speed with how to write a benchmark so that it gives reliable results. Using JMH removes any doubt. (And there is no technical reason not to use it ... that I am aware of.) – Stephen C May 07 '19 at 11:07
  • Also, note that when a benchmark involves I/O 1) there are warmup effects at the OS level; e.g. the behavior of the buffer cache on Linux, and 2) you need to run a few repetitions because of file system / network variability. What we are really trying to optimize here is the *real* time to compress the data *and* write it somewhere. This is not a case where "least CPU == best" is necessarily correct. – Stephen C May 07 '19 at 11:13
  • Finally, 1) I mentioned a profiler already, and 2) he may not have access to Java Mission Control Flight Recorder; e.g. if he has switched to OpenJDK to get around the new Oracle "commercial use" rules. – Stephen C May 07 '19 at 11:16
  • Sure: Making a benchmark is easy. Making a benchmark that tells the truth™ is hard or close to impossible. I'm not sure how one can handle the OS/HDD caching with JMH (as of your second comment: ) There's no point in optimizing the slowest method of 10 JMH runs when the goal is to process a single file *once* and the actual time is spent in *reading/writing* the file (once!). Couldn't the repeated runs of JMH even distort the results here? (That's just gut feeling - I'm not deeply familiar with JMH. One of the main reasons is that even the smallest benchmark can take ages when run in JMH...) – Marco13 May 07 '19 at 11:36
  • @StephenC I have tried other compression algorithms like Snappy,lz4 but both have terrible compression most of the time they have compressed more than the original size.I know BLOB is optimal for storing these type of data and its saves space but the requirement is like that i.even i Tried with no compression mode in Gzip but still no big difference in the time.I agree with you with 1 point to make it stream .can give me an example or modified version of my code,Thank you for your time – SHUHAIB AREEKKAN May 07 '19 at 11:37
  • 1) If you are not prepared to discuss BLOB versus CLOB, then you are limitting the perfomance improvement you can get. 2) Read the references I gave you. 3) If you don't get any compression for XML, you have probably coded it incorrectly. 4) You should be able to find sample code using Google. – Stephen C May 07 '19 at 11:41
  • @StephenC I am having issues with this line bewlow gzipOutputStream.write(str.getBytes(StandardCharsets.UTF_8)); In this code str.getBytes() does not take too much time but i will improve this step in the future but i need to immediately improve the gzipOutputStream.write() this method ,its taking most of the time comparing to any step which include all conversions. – SHUHAIB AREEKKAN May 07 '19 at 11:43
  • *"can give me a modified version of my code"* - How much are you offering to pay me. Would $200 per hour be reasonable? :-) (Hint: I think I am finished with this question now .....) – Stephen C May 07 '19 at 11:44
  • @StephenCyour asking for my monthly payment for this job.you got investigate how much will get paid for indian for his first job as Junior dev.I am not interested in money matter.Thanks for your all effort .it gave a spark to find out myself .. – SHUHAIB AREEKKAN May 07 '19 at 11:52
  • @SHUHAIBAREEKKAN - That was a joke. But the serious point is that I am not going to do your work for you. You shouldn't ask *anyone* to do that kind of thing for you on StackOverflow. – Stephen C May 07 '19 at 11:57
  • @StephenC I am sorry. I have very low experience with StackOverflow. – SHUHAIB AREEKKAN May 07 '19 at 12:02