I have a small program that writes 10 records to a block compressed SequenceFile on HDFS every second, and then run sync() every 5 minutes to ensure that everything older than 5 minutes are available for processing.
As my code is quite a few lines, I have only extracted the important bits:
// initialize
Configuration hdfsConfig = new Configuration();
CompressionCodecFactory codecFactory = new CompressionCodecFactory(hdfsConfig);
CompressionCodec compressionCodec = codecFactory.getCodecByName("default");
SequenceFile.Writer writer = SequenceFile.createWriter(
hdfsConfig,
SequenceFile.Writer.file(path),
SequenceFile.Writer.keyClass(LongWritable.class),
SequenceFile.Writer.valueClass(Text.class),
SequenceFile.Writer.compression(SequenceFile.CompressionType.BLOCK;, compressionCodec)
);
// ...
// append
LongWritable key = new LongWritable((new Date).getTime());
Text val = new Text("Some value");
writer.append(key, val);
// ...
// then every 5 minutes...
logger.info("about to sync...");
writer.hsync();
logger.info("synced!");
From the logs alone, the sync operation appears to work just as expected, however, the file on HDFS remains small. After a while, there may be added some headers and some events, but even close to the frequency as I hsync(). Once the file is closed, then everything is flushed at once.
After each expected sync have also tried to manually check the content of the file to see if the data is there, however, the file appears empty here as well: hdfs dfs -text filename
Is there any known reasons why writer.hsync() does not work, and if so, are there any workarounds for this?
Further test case for this issue:
import java.util.HashMap;
import java.util.Map;
import java.util.Date;
import java.util.Calendar;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.text.DateFormat;
import java.text.ParseException;
import java.util.Calendar;
import java.util.Date;
import java.util.Locale;
public class WriteTest {
private static final Logger LOG = LoggerFactory.getLogger(WriteTest.class);
public static void main(String[] args) throws Exception {
SequenceFile.CompressionType compressionType = SequenceFile.CompressionType.RECORD;
CompressionCodec compressionCodec;
String compressionCodecStr = "default";
CompressionCodecFactory codecFactory;
Configuration hdfsConfig = new Configuration();
codecFactory = new CompressionCodecFactory(hdfsConfig);
compressionCodec = codecFactory.getCodecByName(compressionCodecStr);
String hdfsURL = "hdfs://10.0.0.1/writetest/";
Date date = new Date();
Path path = new Path(
hdfsURL,
"testfile" + date.getTime()
);
SequenceFile.Writer writer = SequenceFile.createWriter(
hdfsConfig,
SequenceFile.Writer.keyClass(LongWritable.class),
SequenceFile.Writer.valueClass(Text.class),
SequenceFile.Writer.compression(compressionType, compressionCodec),
SequenceFile.Writer.file(path)
);
for(int i=0;i<10000000;i++) {
Text value = new Text("New value!");
LongWritable key = new LongWritable(date.getTime());
writer.append(key, value);
writer.hsync();
Thread.sleep(1000);
}
writer.close();
}
}
Result is that there is one fsync at the beginning writing the sequencefile headers, and then no more fsyncs. Content is written to disc once the file is closed.