Cannot read (read_csv) from HDFS using Dask (FileNotFoundError: [Errno 2])

Question

I have a cluster with installed hadoop:

hadoop version
Hadoop 3.1.1.3.0.1.0-187
Source code repository git@github.com:hortonworks/hadoop.git -r 2820e4d6fc7ec31ac42187083ed5933c823e9784
Compiled by jenkins on 2018-09-19T10:19Z
Compiled with protoc 2.5.0
From source with checksum 889327faf5a6ca5fc06fcf97c13af29
This command was run using /usr/hdp/3.0.1.0-187/hadoop/hadoop-common-3.1.1.3.0.1.0-187.jar

I also have installed python 3 and Dask package: https://github.com/dask/dask Installed from source.

I tried next code:

import dask
import dask.dataframe as dd

dask.config.set({"hdfs_driver": "pyarrow"})
df = dd.read_csv('hdfs://master01.myserver.ru:8020/data/batch/82.csv')

I'm sure this file exist (I checked it with hadoop fs -ls /data/batch) Also I tried use PySpark and it works (I read this csv). But using Dask I have next error:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ml/py3-env/lib64/python3.6/site-packages/ptvsd/__main__.py", line 269, in <module>
    main()
  File "/home/ml/py3-env/lib64/python3.6/site-packages/ptvsd/__main__.py", line 265, in main
    wait=args.wait)
  File "/home/ml/py3-env/lib64/python3.6/site-packages/ptvsd/__main__.py", line 258, in handle_args
    debug_main(addr, name, kind, *extra, **kwargs)
  File "/home/ml/py3-env/lib64/python3.6/site-packages/ptvsd/_local.py", line 45, in debug_main
    run_file(address, name, *extra, **kwargs)
  File "/home/ml/py3-env/lib64/python3.6/site-packages/ptvsd/_local.py", line 79, in run_file
    run(argv, addr, **kwargs)
  File "/home/ml/py3-env/lib64/python3.6/site-packages/ptvsd/_local.py", line 140, in _run
    _pydevd.main()
  File "/home/ml/py3-env/lib64/python3.6/site-packages/ptvsd/_vendored/pydevd/pydevd.py", line 1934, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/home/ml/py3-env/lib64/python3.6/site-packages/ptvsd/_vendored/pydevd/pydevd.py", line 1283, in run
    return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
  File "/home/ml/py3-env/lib64/python3.6/site-packages/ptvsd/_vendored/pydevd/pydevd.py", line 1290, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/ml/py3-env/lib64/python3.6/site-packages/ptvsd/_vendored/pydevd/_pydev_imps/_pydev_execfile.py", line 25, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "connectors/RemoteConnect.py", line 69, in <module>
    df = dd.read_csv('hdfs://master01.dev.rlc.msk.mts.ru:8020/data/batch/82.csv')
  File "/home/ml/py3-env/lib64/python3.6/site-packages/dask-master-py3.6.egg/dask/dataframe/io/csv.py", line 488, in read
    **kwargs)
  File "/home/ml/py3-env/lib64/python3.6/site-packages/dask-master-py3.6.egg/dask/dataframe/io/csv.py", line 343, in read_pandas
    **(storage_options or {}))
  File "/home/ml/py3-env/lib64/python3.6/site-packages/dask-master-py3.6.egg/dask/bytes/core.py", line 80, in read_bytes
    storage_options=kwargs)
  File "/home/ml/py3-env/lib64/python3.6/site-packages/dask-master-py3.6.egg/dask/bytes/core.py", line 354, in get_fs_token_paths
    fs, fs_token = get_fs(protocol, options)
  File "/home/ml/py3-env/lib64/python3.6/site-packages/dask-master-py3.6.egg/dask/bytes/core.py", line 513, in get_fs
    fs = cls(**storage_options)
  File "/home/ml/py3-env/lib64/python3.6/site-packages/dask-master-py3.6.egg/dask/bytes/pyarrow.py", line 35, in __init__
    self.fs = pa.hdfs.HadoopFileSystem(**update_hdfs_options(kwargs))
  File "/home/ml/py3-env/lib64/python3.6/site-packages/pyarrow/hdfs.py", line 36, in __init__
    _maybe_set_hadoop_classpath()
  File "/home/ml/py3-env/lib64/python3.6/site-packages/pyarrow/hdfs.py", line 136, in _maybe_set_hadoop_classpath
    classpath = _hadoop_classpath_glob('hadoop')
  File "/home/ml/py3-env/lib64/python3.6/site-packages/pyarrow/hdfs.py", line 157, in _hadoop_classpath_glob
    return subprocess.check_output(hadoop_classpath_args)
  File "/usr/lib64/python3.6/subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "/usr/lib64/python3.6/subprocess.py", line 403, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib64/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/usr/lib64/python3.6/subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'hadoop': 'hadoop'

I suppose it's because of my wrong paths (like in this tutorial). Because I really have no HADOOP_HOME variable in printenv. But adding this variable manually does'n affect problem.

Indeed, this is not a malfunction of the driver, but apparently its inability to find your config or hadoop executables. You should probably try to find where the executable `hadoop` points to and the set of environment variables that get used by the spark client. Someone from pyarrow should help you. — mdurant, Feb 12 '19 at 15:56
@mdurant so maybe the problem is wrong environment variables? If I show them here, can you take a look at them? — Mikhail_Sam, Feb 13 '19 at 09:38
@mdurant You are right - I checked just pyarrow and it give the same error. So problem is in the pyarrow module. — Mikhail_Sam, Feb 13 '19 at 10:11
Yes, do show the variables, both as seen by python and as seen by the (working) spark. I'm probably not the one to make sense of them, though. — mdurant, Feb 13 '19 at 13:35

Cannot read (read_csv) from HDFS using Dask (FileNotFoundError: [Errno 2])

0 Answers0