2

In hadoop, Is there any limit to the size of data that can be accessed/Ingested to HDFS through knox + webhdfs?

Satheesha
  • 33
  • 1
  • 5

2 Answers2

3

Apache Knox is your best option when you need to access webhdfs resources from outside of a cluster that is protected with firewall/s. If you don't have access to all of the datanode ports then direct access to webhdfs will not work for you. Opening firewall holes for all of those host:ports defeats the purpose of the firewall, introduces a management nightmare and leaks network details to external clients needlessly.

As Hellmar indicated, it depends on your specific usecase/s and clients. If you need to do ingestion of huge files or numbers of files then you may want to consider a different approach to accessing the cluster internals for those clients. If you merely need access to files of any size then you should be able to extend that access to many clients.

Not having to authenticate using kerberos/SPNEGO to access such resources opens up many possible clients that would otherwise be unusable with secure clusters.

The Knox users guide has examples for accessing webhdfs resources - you can find them: http://knox.apache.org/books/knox-0-7-0/user-guide.html#WebHDFS - this also illustrates the groovy based scripting available from Knox. This allows you to do some really interesting things.

lmccay
  • 396
  • 1
  • 9
2

In theory, there is no limit. However, using Knox creates a bottleneck. Pure WebHDFS would redirect the read/write request for each block to a (possibly) different datanode, parallelizing access; but with Knox everything is routed through a single gateway and serialized.

That being said, you would probably not want to upload a huge file using Knox and WebHDFS. It will simply take too long (and depending on your client, you may get a timeout.)

Hellmar Becker
  • 2,824
  • 12
  • 18
  • Thank you Hellmar for the reply.Is there any alternative solution other than knox that I can make use in order to access the data (not for upload) using WebHDFS in a secure way? – Satheesha Sep 22 '15 at 08:56
  • You can use WebHDFS over HTTPS, and secure it with Kerberos and SPNEGO. The downside is that you will need a Kerberos client on each machine that needs that kind of access. – Hellmar Becker Sep 22 '15 at 09:02