I try to access HDFS in Hadoop Sandbox with the help of Java API from a Spring Boot application. To specify the URI to access the filesystem by I use a configuration parameter spring.hadoop.fsUri
. HDFS itself is protected by Apache Knox (which to me should act just as a proxy that handles authentication). So if I call the proxy URI with curl, I use the exact same semantics as I would use without Apache Knox. Example:
curl -k -u guest:guest-password https://sandbox.hortonworks.com:8443/gateway/knox_sample/webhdfs/v1?op=GETFILESTATUS
Problem is that I can't access this gateway using the Hadoop client library. Root URL in the configuration parameter is:
spring.hadoop.fsUri=swebhdfs://sandbox.hortonworks.com:8443/gateway/knox_sample/webhdfs/v1
All the requests get Error 404 and the problem why is visible from the logs:
2015-11-19 16:42:15.058 TRACE 26476 --- [nio-8090-exec-9] o.a.hadoop.hdfs.web.WebHdfsFileSystem : url=https://sandbox.hortonworks.com:8443/webhdfs/v1/?op=GETFILESTATUS&user.name=tarmo
It destroys my originally provided fsURI. If I debugged what happens in the internals of Hadoop API, I see that it takes only the domain part sandbox.hortonworks.com:8443
and appends /webhdfs/v1/
to it from a constant. So whatever my original URI is, at the end it will be https://my-provided-hostname/webhdfs/v1
. I understand that it might have something to do with the swebhdfs://
beginning but I can't use https://
directly because in that case an exception will be thrown how there is no such filesystem as https.
Googling this, I found an old mailing list thread where someone had the same problem, but no one ever answered the poster.
Does anyone know what can be done to solve this problem?