3

I need some help. I am downloading a file from a webpage using python code and placing it in local file system and then transferring it into HDFS using put command and then performing operations on it.

But there might be some situations where the file size will be very large and downloading into Local File System is not a right procedure. So I want the file to be directly be downloaded into HDFS with out using the local file system at all.

Can any one suggest me some methods which one would be the best method to proceed? If there are any errors in my question please correct me.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Rahul
  • 243
  • 2
  • 6
  • 17

1 Answers1

6

You can pipe it directly from a download to avoid writing it to disk, e.g.:

curl server.com/my/file | hdfs dfs -put - destination/file

The - parameter to -put tells it to read from stdin (see the documentation).

This will still route the download through your local machine, though, just not through your local file system. If you want to download the file without using your local machine at all, you can write a map-only MapReduce job whose tasks accept e.g. an input file containing a list of files to be downloaded and then download them and stream out the results. Note that this will require your cluster to have open access to the internet which is generally not desirable.

xkrogen
  • 624
  • 4
  • 14
  • This will download a file in local system or it dosen't? – Rahul Dec 05 '17 at 17:18
  • The command I supplied will not download anything to your local file system, however it _will_ download via your local machine's network, then re-upload them to HDFS. It just will not write them to the file system in the mean time. Not sure if this is what you were looking for. I also described how you can do this cutting out the local machine entirely. – xkrogen Dec 06 '17 at 17:24