0

I am using below code to read large xml file (in GBs) in hadoop RecordReader using XMLStreamReader

public class RecordReader {
   int progressCouunt = 0;
   public RecordReader() {
    XMLInputFactory factory = XMLInputFactory.newInstance();
    FSDataInputStream fdDataInputStream = fs.open(file); //hdfs file
    try {
          reader = factory.createXMLStreamReader(fdDataInputStream);
    } catch (XMLStreamException exception) {
           throw new RuntimeException("XMLStreamException exception : ", exception);
    }
   }
   @Override
  public float getProgress() throws IOException, InterruptedException {
     return progressCouunt; 
   }
}

My question is how to get reading progress of the file with XMLStreamReader as it does not provide any start or end position to calculate the progress percentage. I have refered to How do I keep track of parsing progress of large files in StAX?, but cannot user filterReader. Please help me here.

Community
  • 1
  • 1
Bhushan Kawadkar
  • 28,279
  • 5
  • 35
  • 57

1 Answers1

2

You could wrap the InputStream by extending FilterInputStream.

public interface InputStreamListener {
    void onBytesRead(long totalBytes);
}

public class PublishingInputStream extends FilterInputStream {
    private final InputStreamListener;
    private long totalBytes = 0;

    public PublishingInputStream(InputStream in, InputStreamListener listener) {
       super(in);
       this.listener = listener;
    }

    @Override
    public int read(byte[] b) {
       int count = super.read(b);
       this.totalBytes += count;
       this.listener.onBytesRead(totalBytes);
    }

    // TODO: override the other read() methods
}

Usage

XMLInputFactory factory = XMLInputFactory.newInstance();
InputStream in = fs.open(file);
final long fileSize = someHadoopService.getFileLength(file);
InputStremListener listener = new InputStreamListener() {
    public void onBytesRead(long totalBytes) {
        System.out.println(String.format("Read %s of %s bytes", totalBytes, fileSize));
    }
};
InputStream publishingIn = new PublishingInputStream(in, listener);
try {
    reader = factory.createXMLStreamReader(publishingIn);
    // etc
lance-java
  • 25,497
  • 4
  • 59
  • 101
  • actually I am using org.apache.hadoop.mapreduce.RecordReader and need to read progress inside it. Could you please help me here. – Bhushan Kawadkar Jun 10 '16 at 15:07
  • So, update the progress in the custom `InputStreamListener `. To get a precentage you'll need to know the total bytes. `InputStream.available()` does NOT guarantee to return the total bytes (it returns the total number that can be read without blocking). But you may find that this method works (depending on the InputStream implementation) – lance-java Jun 10 '16 at 15:43
  • I have tried using with `.available()` method but here total read bytes and available bytes are always same. – Bhushan Kawadkar Jun 13 '16 at 10:07