0

We have an application that downloads files from FTP server . We are planning to improve its efficiency using Map reduce to download the files from ftp . My first question is , is it actually possible to improve efficiency using Map reduce ? What we logically argue is that a number of mappers and a read channel in each mapper would actually make the downloading process in parallel faster . But we are not sure of teh technical roadblockas if any . Any pointers ?

RadAl
  • 404
  • 5
  • 23

1 Answers1

0

If you are expecting to improve your download speed by making use of Map Reduce then that is not going to help much.

What you can do with Map Reduce is something like if you have files worth of 100Gb and you want to process them and find a count of particular word efficiently. But even for that Map Reduce just can't straight away work on top of files available in FTP. In order for Map Reduce to work you need the file to be available in Hadoop Distributed File System (HDFS).

To understand what is and isn't hadoop read this post.

shazin
  • 21,379
  • 3
  • 54
  • 71
  • Thanks Shazin .. Please clarify "But even for that Map Reduce just can't straight away work on top of files available in FTP. In order for Map Reduce to work you need the file to be available in Hadoop Distributed File System (HDFS)." Does it mean that there is no way map reduce can work on files on the ftp server ? – RadAl Nov 21 '12 at 08:42
  • Yes. Map Reduce requires files to be downloaded and "put" into HDFS before it can be used for processing. It cannot perform directly on top of FTP server files. – shazin Nov 21 '12 at 10:43
  • Thanks again .. But then I was hoping for something along the lines of DBInputFormat class which reads data using MapReduce directly form Database. Here it doesn't bring things into hdfs before performing a mapreduce read . Any idea if we have something like that for a file reader ? – RadAl Nov 23 '12 at 05:02
  • Yes DBInputFormat will enable you to source a Database Table to Map Reduce program without having it in HDFS. But in this case there will be database reading (ResultSet). But it is not the same for FTP. Anyway you have to download the file to read it. Streaming of FTP files may be possible but will result in bandwidth problems. – shazin Nov 24 '12 at 07:07