3

I am trying read data from a Avro File file stored in HDFS. Now so far I am able to read the entire data by using DataFileReader or DataFileStream. Now I want to implement pagination. Is there any specific way to do it ?

I have already gone through their basic documentations and as per my understanding I think this can be done by using Synchronization Marker. I have tried by :

SeekableInput seekableInput = new AvroFSInput(dataInputStream, 5);    
    DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>();
    DataFileReader<GenericRecord> fileReader = new DataFileReader<GenericRecord>(seekableInput, datumReader);
    fileReader.seek(startOffset);  // set to the start-offset
    while (fileReader.hasNext() && !fileReader.pastSync(endOffset)) {
          GenericRecord gr = fileReader.next();
          System.out.println(gr);
    }

But this code giving me a :

Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync!
    at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210)
    at com.globalids.test.AvroTest.deserializeWithPageing(AvroTest.java:112)
    at com.globalids.test.AvroTest.main(AvroTest.java:45)
Caused by: java.io.IOException: Invalid sync!
    at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293)
    at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:198)
    ... 2 more 

I have also tried setting sync interval during data writing process. Also tried to call sync() method after each record is inserted to the file using DataFileWriter. Can anyone point me out what I'm doing wrong ?

Thank you in advance.

Pradatta
  • 3,000
  • 1
  • 18
  • 22

1 Answers1

2

You need to call sync() instead of seek() if startOffset is not from valid position in file :

SeekableInput seekableInput = new AvroFSInput(dataInputStream, 5);    
DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> fileReader = new DataFileReader<GenericRecord>(seekableInput, datumReader);

**fileReader.sync(startOffset);**

while (fileReader.hasNext() && !fileReader.pastSync(endOffset)) {
    GenericRecord gr = fileReader.next();
    System.out.println(gr);
}
bestnaja
  • 36
  • 1
  • This code did work. At-least it's not giving me any exceptions. But still the records I'm getting are from very start of the file. What I wanted to do is: If my startOffset is 5 then I want fileReader to start reading from record number 5. So that I can jump into any record whenever I need for Pagination purposes. Can you please suggest me any idea ? And thank you very much for you previous answer. – Pradatta Dec 20 '13 at 10:28
  • Avro does not support seeking to a record, but does support seeking to a block. And each block tells you how many records are in it, so you can work out how to get to a specific index that way. – Codeman Aug 20 '20 at 00:42