0

I have a need to transfer data fairly regularly (on demand, not scripted / streamed) between two independent hadoop clusters. One of which is deployed in an isolated network and has no direct access to another.

I tried searching the official documentation and web for the answers, but it seems like that is a rather non-trivial task to accomplish. So the only answers I found relate to proxying REST service.

Is there a way to proxy distcp functionality in some way?

Maybe there is some other efficient (and scalable?) way to transfer data between two isolated hadoop clusters via some kind of temporary storage?

Oneiroi
  • 2,063
  • 1
  • 15
  • 28

1 Answers1

1

What you could do is to set up an HDFS/NFS gateway service in each cluster and then on an intermediate host mount the two shares(one from each cluster). Then you could quite easily copy files back and forth(as a user on the intermediate host). The firewall on the intermediate host would need to be set up to increase security. Beware that the HDFS audit logs do not understand the NFS gateway's way of doing things, so the logs will only show a generic NFS user doing things in the filesystems, not the real users.

HTH edit: clarification

tonyalbers
  • 11
  • 1