-1

I have FTP server (F [ftp]), linux box(S [standalone]) and hadoop cluster (C [cluster]). The current files flow is F->S->C. I am trying to improve performance by skipping S.

The current flow is:

wget ftp://user:password@ftpserver/absolute_path_to_file
hadoop fs -copyFromLocal path_to_file path_in_hdfs

I tried:

hadoop fs -cp ftp://user:password@ftpserver/absolute_path_to_file path_in_hdfs

and:

hadoop distcp ftp://user:password@ftpserver/absolute_path_to_file path_in_hdfs

Both hangs. The distcp one being a job is killed by timeout. The logs (hadoop job -logs) only said it was killed by timeout. I tried to wget from the ftp from some node of the C and it worked. What could be the reason and any hint how to figure it out?

Denis
  • 1,130
  • 3
  • 17
  • 32
  • Latest findings: 1. distcp processes map 100%, hangs on reduce 0%, but finally prints that map was cancelled by timeout; 2. distcp -log /hdfspath is for some reason empty; 3. I am able to fs -cp and distcp from public repository, namely Mozillas for the same cluster. I am investigating: 1. whether all nodes from cluster have access to the FTP I am trying to copy from. 2. Check FTP server known issues. – Denis Sep 24 '14 at 10:49
  • Even more info, fs -put and distcp create path_in_hdfs/filename._COPYING_ file of the correct size. – Denis Sep 24 '14 at 11:05
  • More info: The server is tftpd that runs on SunOs 5.10 – Denis Sep 25 '14 at 08:54

2 Answers2

2

Pipe it through stdin:

 wget ftp://user:password@ftpserver/absolute_path_to_file | hadoop fs -put - path_in_hdfs

The single - tells HDFS put to read from stdin.

J Maurer
  • 1,044
  • 10
  • 18
-2

hadoop fs -cp ftp://user:password@ftpserver.com/absolute_path_to_file path_in_hdfs

This cannot be used as the source file is a file in the local file system. It does not take into account the scheme you are trying to pass. Refer to the javadoc: FileSystem

DISTCP is only for large intra or inter cluster (to be read as Hadoop clusters i.e. HDFS). Again it cannot get data from FTP. 2 step process is still your best bet. Or write a program to read from FTP and write to HDFS.

Venkat
  • 1,810
  • 1
  • 11
  • 14
  • The source is on FTP. I tried hadoop fs -cp from FTP on a different cluster and it worked. So it is an valid option, distcp also started working, but failed with memory exception, so not sure about that. – Denis Sep 24 '14 at 07:46
  • No way. Have you seen the javadoc or the source? I would advice you to try that. – Venkat Sep 24 '14 at 18:42
  • Venkat, thank you for your help, but I tried it from public FTP and it worked. – Denis Sep 25 '14 at 08:52