0

I try to access HDFS in Hadoop Sandbox with the help of Java API from a Spring Boot application. To specify the URI to access the filesystem by I use a configuration parameter spring.hadoop.fsUri. HDFS itself is protected by Apache Knox (which to me should act just as a proxy that handles authentication). So if I call the proxy URI with curl, I use the exact same semantics as I would use without Apache Knox. Example:

curl -k -u guest:guest-password https://sandbox.hortonworks.com:8443/gateway/knox_sample/webhdfs/v1?op=GETFILESTATUS

Problem is that I can't access this gateway using the Hadoop client library. Root URL in the configuration parameter is:

spring.hadoop.fsUri=swebhdfs://sandbox.hortonworks.com:8443/gateway/knox_sample/webhdfs/v1

All the requests get Error 404 and the problem why is visible from the logs:

2015-11-19 16:42:15.058 TRACE 26476 --- [nio-8090-exec-9] o.a.hadoop.hdfs.web.WebHdfsFileSystem    : url=https://sandbox.hortonworks.com:8443/webhdfs/v1/?op=GETFILESTATUS&user.name=tarmo

It destroys my originally provided fsURI. If I debugged what happens in the internals of Hadoop API, I see that it takes only the domain part sandbox.hortonworks.com:8443 and appends /webhdfs/v1/ to it from a constant. So whatever my original URI is, at the end it will be https://my-provided-hostname/webhdfs/v1. I understand that it might have something to do with the swebhdfs:// beginning but I can't use https:// directly because in that case an exception will be thrown how there is no such filesystem as https.

Googling this, I found an old mailing list thread where someone had the same problem, but no one ever answered the poster.

Does anyone know what can be done to solve this problem?

Tarmo
  • 3,851
  • 2
  • 24
  • 41
  • Just for information to readers of this question. Since I did not find any way to get past this behaviour using Hadoop API, I implemented the few interactions that I had with HDFS, using Apache HTTP Client and Spring's Rest Template. – Tarmo Nov 25 '15 at 15:36

1 Answers1

1

I apologize for being so late in this response.

You may be able to leverage the Apache Knox Default Topology URL. In your description, you happen to be using a topology called knox_sample. In order to access that topology as the "Default Topology", you would have to configure it as the default topology name. See: http://knox.apache.org/books/knox-0-7-0/user-guide.html#Default+Topology+URLs

The default "Default Topology" name is sandbox

lmccay
  • 396
  • 1
  • 9
  • If I understand you correctly, then your solution would change the URL from Knox side, that so that it would look like a usual webhdfs URL and so work with the official API. But that solution does not work for me, because another party is the Knox/Hadoop owner and I have no access to those components configuration. – Tarmo Jan 18 '16 at 18:11
  • That is an interesting situation.... You could consider another Knox instance that you control and use the Default Topology which would in turn rewrite the URL again to the expected one when dispatching to the original Knox instance. It isn't at all clear to me how you are currently authenticating to the original Knox instance but if you are using basic auth as you do in the curl example then that would probably not work out of the box. You would need a new dispatch that sent basic credentials. You'd need to use the hadoop-auth provider in the local Knox for pseudo auth "user.name" parameter. – lmccay Jan 19 '16 at 17:53
  • Thank you for your input. That all might work. But I took the easier path. Since I only needed to do only a few really trivial and straight forward HDFS interactions, I made my own small API for it on top of Springs Rest Template/Apache Commons HTTP. This way I can choose whatever base URL I like and not set up layers of environments to make it work. It has worked well for me up to now. If at some point my solution does not service me well enough anymore then I'll look more into that. But again, thank you. – Tarmo Jan 20 '16 at 08:10