Count the bytes written to file via BufferedWriter formed by GZIPOutputStream

Question

I have a BufferedWriter as shown below:

BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(
        new GZIPOutputStream( hdfs.create(filepath, true ))));

String line = "text";
writer.write(line);

I want to find out the bytes written to the file with out querying file like

hdfs = FileSystem.get( new URI( "hdfs://localhost:8020" ), configuration );

filepath = new Path("path");
hdfs.getFileStatus(filepath).getLen();

as it will add overhead and I don't want that.

Also I cant do this:

line.getBytes().length;

As it give size before compression.

Sounds like you want some kind of Java [`tee`](http://www.frischcode.com/2013/11/need-to-write-same-content-to-multiple.html). — Elliott Frisch, Aug 29 '14 at 15:10

score 2 · Accepted Answer · answered Aug 29 '14 at 15:21

You can use the CountingOutputStream from Apache commons IO library.

Place it between the GZIPOutputStream and the file Outputstream (hdfs.create(..)).

After writing the content to the file you can read the number of written bytes from the CountingOutputStream instance.

score 2 · Answer 2 · edited Oct 26 '19 at 00:14

If this isn't too late and you are using 1.7+ and you don't wan't to pull in an entire library like Guava or Commons-IO, you can just extend the GZIPOutputStream and obtain the data from the associated Deflater like so:

public class MyGZIPOutputStream extends GZIPOutputStream {

  public MyGZIPOutputStream(OutputStream out) throws IOException {
      super(out);
  }

  public long getBytesRead() {
      return def.getBytesRead();
  }

  public long getBytesWritten() {
      return def.getBytesWritten();
  }

  public void setLevel(int level) {
      def.setLevel(level);
  }
}

score 0 · Answer 3 · answered Aug 29 '14 at 15:07

0

You can make you own descendant of OutputStream and count how many time write method was invoked

answered Aug 29 '14 at 15:07

talex

17,973
3
29
66

score 0 · Answer 4 · answered Oct 26 '19 at 01:17

This is similar to the response by Olaseni, but I moved the counting into the BufferedOutputStream rather than the GZIPOutputStream, and this is more robust, since def.getBytesRead() in Olaseni's answer is not available after the stream has been closed.

With the implementation below, you can supply your own AtomicLong to the constructor so that you can assign the CountingBufferedOutputStream in a try-with-resources block, but still retrieve the count after the block has exited (i.e. after the file is closed).

public static class CountingBufferedOutputStream extends BufferedOutputStream {
    private final AtomicLong bytesWritten;

    public CountingBufferedOutputStream(OutputStream out) throws IOException {
        super(out);
        this.bytesWritten = new AtomicLong();
    }

    public CountingBufferedOutputStream(OutputStream out, int bufSize) throws IOException {
        super(out, bufSize);
        this.bytesWritten = new AtomicLong();
    }

    public CountingBufferedOutputStream(OutputStream out, int bufSize, AtomicLong bytesWritten)
            throws IOException {
        super(out, bufSize);
        this.bytesWritten = bytesWritten;
    }

    @Override
    public void write(byte[] b) throws IOException {
        super.write(b);
        bytesWritten.addAndGet(b.length);
    }

    @Override
    public void write(byte[] b, int off, int len) throws IOException {
        super.write(b, off, len);
        bytesWritten.addAndGet(len);
    }

    @Override
    public synchronized void write(int b) throws IOException {
        super.write(b);
        bytesWritten.incrementAndGet();
    }

    public long getBytesWritten() {
        return bytesWritten.get();
    }
}

I wonder if we can hit an issue with duplicated counting, let's say some inner implementation would call write like that: write(byte[] b, int off, int len) followed by write(int b) for every byte in the array. We run into the issue with those bytes counted twice. — kolboc, Jan 28 '21 at 13:32
@kolboc no, the write(byte[] b, int off, int len) call in that case would not be in the loop with the write(int b) call, so there world be no double counting. — Luke Hutchison, Jan 29 '21 at 16:07
For this specific implementation, it might be safe, but generally, it's a possibility. That might be a good reason to write such counting streams as wrappers instead, and instead of super calls, use the instance of the wrapped stream. eg. class CountintOutputStream(val outputStream: OutputStream): OutputStream { override write(int b) { byteCount++; outputStream.write(b); } } That way you're safe to do that regardless of things going on under the hood. — kolboc, Jan 29 '21 at 17:29
Sorry, there is simply no way for this to count bytes wrongly. The implementation I gave counts exactly the number of bytes written. There is no way to call this and have a different number of bytes written than is counted. The implementation you showed is pretty much exactly my `write(int)` method. — Luke Hutchison, Jan 30 '21 at 18:39

Count the bytes written to file via BufferedWriter formed by GZIPOutputStream

4 Answers4

Linked

Related