3

I have an application which is supposed to copy over a large number of files from a source such as S3 into HDFS. The application uses apache distcp within and copies each individual file from the source via streaming into HDFS.

Each individual file is around ~1GB, has 1K columns of strings. When I choose to copy over all the columns, the write fails with the following error :-

2014-05-20 23:57:35,939 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
2014-05-20 23:57:35,939 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/xyz/2014/01/02/control-Jan-2014-14.gz" - Aborting...
2014-05-20 23:57:54,369 ERROR abc.mapred.distcp.DistcpRunnable: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /xyz/2014/01/02/control-Jan-2014-14.gz File does not exist. [Lease.  Holder: DFSClient_attempt_201403272055_15994_m_000004_0_-1476619343_1, pendingcreates: 4]
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1720)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1711)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1619)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:736)
    at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)

I believe that it is due to it taking too much time to write one large file from source into HDFS. When I modify the application to copy over only 50,100 or 200 columns, the application runs to completion. Application fails when number of columns being copied for each row > 200.

I have no control over the source files.

I can not seem to find anything around increasing lease expiration.

Any Pointers ?

Venkat
  • 1,810
  • 1
  • 11
  • 14
user1084874
  • 141
  • 1
  • 1
  • 8
  • Can you give an idea as to after how long do you get this exception? – Venkat May 21 '14 at 00:38
  • Around after one minute when the distcp has opened up the stream with the source file. – user1084874 May 21 '14 at 00:40
  • Only a minute? Does not seem to be because of the time taken to copy. I got this sometimes when the file was deleted while copying. Are you sure some other process is not deleting it? – Venkat May 21 '14 at 00:52
  • Yes I am pretty sure that no other process is deleting any of the files. – user1084874 May 21 '14 at 01:14
  • Please note that there is no such error when I choose to write only till 200 columns for each line in the src file while iterating through the inputStream through bufferedReader. But as and when I go beyond this 200 number, it starts spitting this error. – user1084874 May 21 '14 at 01:21
  • Can you check whether the destination directory path to which you are copying exist.Can you refer File Creation heading in this [link](http://itm-vm.shidler.hawaii.edu/HDFS/ArchDocDecomposition.html) – donut May 21 '14 at 03:29

1 Answers1

6

So Finally I could determine what is going on. So from the source, S3, our application was downloading files like

 /xyz/2014/01/week1/abc
 /xyz/2014/01/week1/def
 /xyz/2014/01/week2/abc
 /xyz/2014/01/week2/def
 /xyz/2014/01/week3/abc
 /xyz/2014/01/week3/def 

Notice the same file names across different weeks. And then each of these files were being written to HDFS using the DFSClient. So essentially multiple mappers were trying to write the "same file" (because of the same file name like abc, def) even though the files were actually different. As the client has to acquire a lease before writing the file and as the client writing the first of the "abc" file was not releasing the lease while during the writing process, the other client trying to write the other "abc" file was throwing the LeaseExpriedException with the Lease Mismatch Message.

But this still does not explain why the client which first acquired the lease for the write did not succeed. I mean I would expect in such a case that the first writers of every such files to succeed. Any explanation ?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
user1084874
  • 141
  • 1
  • 1
  • 8
  • LeaseExpriedException would cause job to fail and hence none[may be some] of the file would be copied. In this case, adding the exact command which you executed to do the distcp along with easy representation of file system will help. – D3V Jul 19 '15 at 09:56
  • Thanks for this. This happened to me when an ansible playbook was inadvertedly copying the same set of files on multiple nodes to the same destination in hdfs. Changing the operation to run once fixed the problem. – trinth May 02 '17 at 18:17