Failed to copy file from FTP to HDFS

Question

I have FTP server (F [ftp]), linux box(S [standalone]) and hadoop cluster (C [cluster]). The current files flow is F->S->C. I am trying to improve performance by skipping S.

The current flow is:

wget ftp://user:password@ftpserver/absolute_path_to_file
hadoop fs -copyFromLocal path_to_file path_in_hdfs

I tried:

hadoop fs -cp ftp://user:password@ftpserver/absolute_path_to_file path_in_hdfs

and:

hadoop distcp ftp://user:password@ftpserver/absolute_path_to_file path_in_hdfs

Both hangs. The distcp one being a job is killed by timeout. The logs (hadoop job -logs) only said it was killed by timeout. I tried to wget from the ftp from some node of the C and it worked. What could be the reason and any hint how to figure it out?

Latest findings: 1. distcp processes map 100%, hangs on reduce 0%, but finally prints that map was cancelled by timeout; 2. distcp -log /hdfspath is for some reason empty; 3. I am able to fs -cp and distcp from public repository, namely Mozillas for the same cluster. I am investigating: 1. whether all nodes from cluster have access to the FTP I am trying to copy from. 2. Check FTP server known issues. — Denis, Sep 24 '14 at 10:49
Even more info, fs -put and distcp create path_in_hdfs/filename._COPYING_ file of the correct size. — Denis, Sep 24 '14 at 11:05

score 2 · Answer 1 · answered Sep 24 '14 at 05:07

2

Pipe it through stdin:

 wget ftp://user:password@ftpserver/absolute_path_to_file | hadoop fs -put - path_in_hdfs

The single - tells HDFS put to read from stdin.

answered Sep 24 '14 at 05:07

J Maurer

1,044
10
18

hadoop fs -cp should work as I tried it on a different cluster. So I would like to start by understanding why does it fails. – Denis Sep 24 '14 at 07:48
How can I copy data from HDFS to ftp? – USB Aug 22 '15 at 10:43

score -2 · Answer 2 · answered Sep 23 '14 at 18:07

-2

hadoop fs -cp ftp://user:password@ftpserver.com/absolute_path_to_file path_in_hdfs

This cannot be used as the source file is a file in the local file system. It does not take into account the scheme you are trying to pass. Refer to the javadoc: FileSystem

DISTCP is only for large intra or inter cluster (to be read as Hadoop clusters i.e. HDFS). Again it cannot get data from FTP. 2 step process is still your best bet. Or write a program to read from FTP and write to HDFS.

answered Sep 23 '14 at 18:07

Venkat

1,810
1
11
14

The source is on FTP. I tried hadoop fs -cp from FTP on a different cluster and it worked. So it is an valid option, distcp also started working, but failed with memory exception, so not sure about that. – Denis Sep 24 '14 at 07:46
No way. Have you seen the javadoc or the source? I would advice you to try that. – Venkat Sep 24 '14 at 18:42
Venkat, thank you for your help, but I tried it from public FTP and it worked. – Denis Sep 25 '14 at 08:52

Failed to copy file from FTP to HDFS

2 Answers2