-2

I have a local machine (local_user@local_machine). And a hadoop file system is present on a different server (some_user@another_server). One of the users in the hadoop server is named target_user. How do I access files present in target_user from local_user@local_machine? More precisely, say there's a file /user/target_user/test.txt present in the HDFS on some_user@another_server. What is the correct file path I should use when accessing /user/target_user/test.txt from local_user@local_machine?

I can access the file in the hdfs itself with hdfs dfs -cat /user/target_user/test.txt. But I can't access the file from my local machine using a python script I have written to read & write from the HDFS (that takes 3 arguments - local file path, remote file path, and read or write), most probably because I am not giving the correct path.

I have tried the following, but none of them work:

$ #local_user@local_machine

$ python3 rw_hdfs.py ./to_local_test.txt /user/target_user/test.txt read

$ python3 rw_hdfs.py ./to_local_test.txt some_user@another_server/user/target_user/test.txt read

The all give the exact same error:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 377, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 560, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 379, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.5/http/client.py", line 1197, in getresponse
    response.begin()
  File "/usr/lib/python3.5/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.5/http/client.py", line 279, in _read_status
    raise BadStatusLine(line)
http.client.BadStatusLine: 


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 376, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 610, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 247, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 560, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 379, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.5/http/client.py", line 1197, in getresponse
    response.begin()
  File "/usr/lib/python3.5/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.5/http/client.py", line 279, in _read_status
    raise BadStatusLine(line)
requests.packages.urllib3.exceptions.ProtocolError: ('Connection aborted.', BadStatusLine('\x15\x03\x03\x00\x02\x02\n',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python_hdfs.py", line 63, in <module>
    status, name, nnaddress= check_node_status(node)
  File "python_hdfs.py", line 18, in check_node_status
    request = requests.get("%s/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus"%name,verify=False).json()
  File "/usr/lib/python3/dist-packages/requests/api.py", line 67, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 426, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine('\x15\x03\x03\x00\x02\x02\n',))
Kristada673
  • 3,512
  • 6
  • 39
  • 93

1 Answers1

0

More precisely, say there's a file /user/target_user/test.txt present in the HDFS on some_user@another_server

First, HDFS isn't a single directory on one machine. Therefore trying to access it like that doesn't make sense.

Secondly, whatever Python library you're using is trying to communicate over WebHDFS, which you must specifically enable for the cluster.

https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

BadStatusLine in the error might indicate that you're dealing with a Kerberized, secure cluster, so you might need a different way to read files

For example, PySpark or the Ibis project

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • In my python code, I have this line to connect to the hdfs: `request = requests.get("%s/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus"%name,verify=False).json()`. And I have previously exported `namenode` as `http://another_server:50470/webhdfs/v1/`. So, it seems according to the link you provided above, its following the REST API syntax. But it still cannot connect to the HDFS. Why is that so? I have also tried exporting `namenode` in the webHDFS syntax as `webhdfs://another_server:50470/target_user/`, but that doesn't work at all, gives `invalid schema` error. – Kristada673 Feb 12 '18 at 08:23
  • And the second thing is, yes its kerberized, and the code works in my office in the kerberized HDFS we have, but only does not work on the client's kerberized HDFS for some reason. – Kristada673 Feb 12 '18 at 10:17
  • You're using Requests, which operates on `http://` protocol. `webhdfs://` there is an "unknown schema", as it says. I have very little experience with Kerberized clusters, but I see that SPNEGO support in requests is not completed. https://github.com/requests/requests-kerberos/pull/89 and https://github.com/requests/requests-kerberos/issues/90 – OneCricketeer Feb 12 '18 at 17:49