4

I have two processes running. One is writing files to an HDFS and the other is loading those files.

The first process (The one that writes the file) is using:

private void writeFileToHdfs(byte[] sourceStream, Path outFilePath) {
FSDataOutputStream out = null;
try {
    // create the file
    out = getFileSystem().create(outFilePath);
    out.write(sourceStream);
} catch (Exception e) {
    LOG.error("Error while trying to write a file to hdfs", e);
} finally {
    try {
    if (null != out)
        out.close();
    } catch (IOException e) {
    LOG.error("Could not close output stream to hdfs", e);
    }
}
}

The second process reads those files for further processing. When creating a file, it is first created and then populated with content. This process takes time (a few milliseconds, but still) and during this time the second process may pick up the file before it is fully written and closed.

Notice that HDFS does not keep locking info in the namenode - so there is no daemon out there that can check if the file is locked before accessing it.

I wonder what is the best way to resolve this issue.

Here are my thoughts:

  1. Copying the files to a new folder once they are fully written and closed, then the second process will read from this new folder.
  2. Renaming a file according to some naming convention once it is fully written and closed then the second process will read according to this naming convention.

I have a feeling I'm trying to solve a well know problem and I'm missing something out. Is there a best practice for such a problem?

forhas
  • 11,551
  • 21
  • 77
  • 111
  • 1
    side note: if you're using Java 7 you don't need to do all that finally stuff, just do a try-with-resources – Ross Drew Dec 10 '13 at 09:20
  • Why don't you use sockets to communicate between your processes? P1 can communicate with P2, and at the same time dump the file such that if they are not online simultaneously P2 can still pick up later... – UmNyobe Dec 10 '13 at 09:22
  • 1
    @RossDrew I am using java7 but quite new to it. I will check it out, thanks. – forhas Dec 10 '13 at 09:30
  • @UmNyobe I'm not sure I'm getting you. I rather decoupling the processes. – forhas Dec 10 '13 at 09:31
  • https://stackoverflow.com/questions/45605977/is-there-any-way-to-check-whether-the-hadoop-file-is-already-opened-for-write – vakarami Mar 04 '21 at 09:55

3 Answers3

3

The Apache commons has some stuff for that. Just touch the file and an error will tell you if it's already locked.

import org.apache.commons.io.*

boolean fileAvail = false;

try {
    FileUtils.touch(fileName); //throws IOException if being used
    fileAvail = true;
} catch (IOException e) {
    fileAvail = false;
}

(also) Try with Resources

In Java 7 you can use this functionality on anything that implements Closable like files, sockets and database connections where it will auto close as soon as the scope of the try block is ended by doing this

 try (FSDataOutputStream out = getFileSystem().create(outFilePath))
 {
   //use out in here
 }
 //No finally required - catch is optional

...saves all that extra code

Ross Drew
  • 8,163
  • 2
  • 41
  • 53
  • does it trigger the exception when the file is open with read permission? – UmNyobe Dec 10 '13 at 09:33
  • I don't think so. I think it basically checks if a files last modified date is editable and throws if not – Ross Drew Dec 10 '13 at 09:38
  • 1
    Look Before You Leap: Do not use exceptions for flow control. – Rafael Winterhalter Dec 10 '13 at 09:39
  • true but I don't see another way of doing what the OP asks (that is aside from your pattern based approach). – Ross Drew Dec 10 '13 at 09:42
  • 1
    Regarding the Try with Resources - Red about it, learned something new and changed my code (Thanks!). Regarding your solution, I'm not sure yet the consumer can do that, let me check that and elaborate more later. – forhas Dec 10 '13 at 11:15
  • 1
    @RossDrew This is a good answer for a standard file system but I'm using an Hadoop's HDFS and this HDFS does not keep locking info in the namenode - so there is no daemon out there that can check if the file is locked before accessing it. (updated the questions) My guess is that your code would never throw an exception - regardless if the file is being used or not. – forhas Dec 10 '13 at 14:29
  • 1
    Then I think perhaps @raphw solution is your best bet. – Ross Drew Dec 10 '13 at 14:31
1

Are you talking about two separate processes here or about two separate threads within the same (JVM) process?

Both ways, this is a consumer-producer problem and what you are missing is some proper synchronization between the producer and the consumer. If you are running two threads within the same JVM process, you could use a BlockingQueue in order to transfer some sort of file-transfer-finished token from the producer to the consumer such as for example the file's name once a file is fully written and its stream closed. Once a file name was found in the queue, the consumer could be certain that the file was fully written and closed because this is was confirmed by the producer.

However, if you are using two different processes, the problem is a little bit harder to solve, depending on the other component's language and the networking setup, but you would have to implement some sort of queue that could be used by both processes for example by sending some information over a local networking port such that the processes would know of each other's work.

No matter what, I would always avoid moving around files on the file system since this is a rather expensive operation compared to sending simple tokens. And also moving arround files might expose files that were not yet completely moved, depending on the language you are using.

Rafael Winterhalter
  • 42,759
  • 13
  • 108
  • 192
  • Indeed, If we are talking about 2 threads it is a simple producer-consumer problem which I'm familiar with. But I', talking about 2 completely different process, running from different machines. – forhas Dec 10 '13 at 09:33
  • Same problem here then: Send a message over an open port where the producing process confirms to the consuming process that file *X* should be handled. I would avoid inferring such state from the file system. This way, you could also add new consumers and producers at some later state and add some load balancing. – Rafael Winterhalter Dec 10 '13 at 09:35
  • you're eliminating the question! suppose the other application is a third party app that we don't have access to change it, then what? – vakarami Mar 04 '21 at 07:04
0

Do you really need two processes here ? why dont you create two threads and then join it.

Abdul Salam
  • 238
  • 1
  • 3
  • 11