I have two processes running. One is writing files to an HDFS and the other is loading those files.
The first process (The one that writes the file) is using:
private void writeFileToHdfs(byte[] sourceStream, Path outFilePath) {
FSDataOutputStream out = null;
try {
// create the file
out = getFileSystem().create(outFilePath);
out.write(sourceStream);
} catch (Exception e) {
LOG.error("Error while trying to write a file to hdfs", e);
} finally {
try {
if (null != out)
out.close();
} catch (IOException e) {
LOG.error("Could not close output stream to hdfs", e);
}
}
}
The second process reads those files for further processing. When creating a file, it is first created and then populated with content. This process takes time (a few milliseconds, but still) and during this time the second process may pick up the file before it is fully written and closed.
Notice that HDFS does not keep locking info in the namenode - so there is no daemon out there that can check if the file is locked before accessing it.
I wonder what is the best way to resolve this issue.
Here are my thoughts:
- Copying the files to a new folder once they are fully written and closed, then the second process will read from this new folder.
- Renaming a file according to some naming convention once it is fully written and closed then the second process will read according to this naming convention.
I have a feeling I'm trying to solve a well know problem and I'm missing something out. Is there a best practice for such a problem?