0

I have a simple code which creates Hadoop's Sequence file. Each the code is ran it leaves in working dir two files:

   mySequenceFile.txt
   .mySequenceFile.txt.crc

After each run the sizes of both files remain the same. But the crc file contents become different!

Is this a bug or an expected behaviour?

MiamiBeach
  • 3,261
  • 6
  • 28
  • 54

1 Answers1

0

This is a confusing, but expected behaviour.
According to SequenceFile standart, each sequencefile has a sync-block, its length is 16 bytes. The sync-block repeats after each record in block-compressed sequencefiles, and after some records or one very long record in uncompressed or record-compressed sequencefiles.
The thing is, that the sync-block is some sort of random value. It is written in the header, so this is how the reader recognizes it. It stays same within one sequencefile, but it can (and actually is) different from one sequencefile to another.
So the files are logically same, but binary different. CRC is binary shecksum, so its different between two files too.
I haven`t found any ways to manually set this sync-block. If someone gets the way, please write it here.

Evgenii Glotov
  • 176
  • 1
  • 7