0

I'm using Dask Distributed and I'm trying to create a dataframe from a CSV stored in HDFS. I suppose the connection to HDFS is successful as I'm able to print the dataframe columns' names. However, I get the following error when I'm trying to use the len function or any other function on the dataframe:

pyarrow.lib.ArrowIOError: HDFS file does not exist: /user/F43479/trip_data_v2.csv

I don't understand why I have this error. I would like to have your opinion.

Here is my code:

# IMPORTS
import dask.dataframe as dd
from dask.distributed import Client
import pyarrow as pa
from pyarrow import csv
from dask import compute,config
import os
import subprocess

# GET HDFS CLASSPATH
classpath = subprocess.Popen(["/usr/hdp/current/hadoop-client/bin/hdfs", "classpath", "--glob"], stdout=subprocess.PIPE).communicate()[0]

# CONFIGURE ENVIRONMENT VARIABLES
os.environ["HADOOP_HOME"] = "/usr/hdp/current/hadoop-client"
os.environ["JAVA_HOME"] = "/home/G60070/installs/jdk1.8.0_201/"
os.environ["CLASSPATH"] = classpath.decode("utf-8")
os.environ["ARROW_LIBHDFS_DIR"] = "/usr/hdp/2.6.5.0-292/usr/lib/"

# LAUNCH DASK DISTRIBUTED
client = Client('10.22.104.37:8786')

# SET HDFS CONNEXION
config.set(hdfs_driver='pyarrow', host='xxxxx.xxx.xx.fr', port=8020)

# READ FILE ON HDFS
folder = 'hdfs://xxxxx.xxx.xx.fr:8020/user/F43479/'
filepath = folder+'trip_data_v2.csv'
df = dd.read_csv(filepath)

# TREATMENTS ON FILE
print(df.columns)# this works
print(len(df))# produces an error

Here is the content of my HDFS repository:

[F43479@xxxxx dask_tests]$ hdfs dfs -ls /user/F43479/
Found 9 items
-rw-r-----   3 F43479 hdfs            0 2019-03-07 16:42 /user/F43479/-
drwx------   - F43479 hdfs            0 2019-04-03 02:00 /user/F43479/.Trash
drwxr-x---   - F43479 hdfs            0 2019-03-13 16:53 /user/F43479/.hiveJars
drwxr-x---   - F43479 hdfs            0 2019-03-13 16:52 /user/F43479/hive
drwxr-x---   - F43479 hdfs            0 2019-03-15 13:23 /user/F43479/nyctaxi_trip_data
-rw-r-----   3 F43479 hdfs           36 2019-04-15 11:13 /user/F43479/test.csv
-rw-r-----   3 F43479 hdfs  50486731416 2019-03-26 17:37 /user/F43479/trip_data.csv
-rw-r-----   3 F43479 hdfs   5097056230 2019-04-15 13:57 /user/F43479/trip_data_v2.csv
-rw-r-----   3 F43479 hdfs 504867312828 2019-04-02 11:15 /user/F43479/trip_data_x10.csv

And finally, the full result of the code execution:

Index(['vendor_id', 'passenger_count', 'trip_time_in_secs', 'trip_distance'], dtype='object')
Traceback (most recent call last):
  File "dask_pa_hdfs.py", line 32, in <module>
    print(len(df))
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/dataframe/core.py", line 438, in __len__
    split_every=False).compute()
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/base.py", line 397, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/client.py", line 2321, in get
    direct=direct)
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/client.py", line 1655, in gather
    asynchronous=asynchronous)
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/client.py", line 673, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/utils.py", line 277, in sync
    six.reraise(*error[0])
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/utils.py", line 262, in f
    result[0] = yield future
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/client.py", line 1500, in _gather
    traceback)
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/bytes/core.py", line 133, in read_block_from_file
    with copy.copy(lazy_file) as f:
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/bytes/core.py", line 177, in __enter__
    f = SeekableFile(self.fs.open(self.path, mode=mode))
  File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/bytes/pyarrow.py", line 37, in open
    return self.fs.open(path, mode=mode, **kwargs)
  File "pyarrow/io-hdfs.pxi", line 431, in pyarrow.lib.HadoopFileSystem.open
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS file does not exist: /user/F43479/trip_data_v2.csv
Sevy
  • 15
  • 2
  • 6
  • Hi, My cluster is not kerberozied. Facing same issue. Below solution is not working. – Nirmal Ram Nov 27 '19 at 14:02
  • @NirmalRam what's your configuration? Do your workers have permission to access the file? Do you have a worker started on each hdfs node? – Sevy Nov 28 '19 at 16:24

2 Answers2

0

You have carefully set up the environment in your local process, containing the client, so that it can communicate with HDFS. For finding out the columns, this is enough, as Dask does this up front from the client process and the first few rows of the data. However:

client = Client('10.22.104.37:8786')

your scheduler and workers live elsewhere, and do not have the environment variables you made available to them. When you run your tasks, the workers do not know how to find the file.

What you need to do is to set the environment on the workers too. This could be done before they are launched, or once already up:

def setenv():
    import os
    os.environ["HADOOP_HOME"] = "/usr/hdp/current/hadoop-client"
    os.environ["JAVA_HOME"] = "/home/G60070/installs/jdk1.8.0_201/"
    os.environ["CLASSPATH"] = classpath.decode("utf-8")
    os.environ["ARROW_LIBHDFS_DIR"] = "/usr/hdp/2.6.5.0-292/usr/lib/"

client.run(setenv)

(should return with a set of None from each worker)

Note that, if new workers come online dynamically, they would each need to run this function before accessing HDFS.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Thank you for your answer. That solution doesn't seem to work. I get the same error. I don't understand why the same env variables need to be precised on the workers. The path contained in those variables reference files located on the Scheduler. These files are not located on the workers, so I think that's why defining those variables in the workers don't have any effect. Does it make sense? – Sevy May 02 '19 at 12:26
  • Right, the workers need to have access to the java/hadoop files too, if you want them to be able to access hdfs. If they are in *different* places on the workers, modify the variables accordingly. – mdurant May 02 '19 at 12:49
0

I solved the problem. It was related to permissions to access HDFS. I'm working on a Kerberised HDFS Cluster and I started Dask Scheduler process on the edge node & the worker processes on the Data nodes.
To access HDFS, pyarrow needs 2 things:

  • It has to be installed on the scheduler and all the workers
  • Environment variables need to be configured on all the nodes as well

Then to access HDFS, the started processes need to be authenticated through Kerberos. When lauching the code from the scheduler process, I am able to connect to HDFS because my session is authenticated through Kerberos. That's why I am able to get informations about the CSV file columns.
However, as the worker processes were not authenticated, they couldn't access HDFS, which caused the error. To solve it, we had to stop the worker processes, modify the script used to start them so that it includes a kerberos command to authenticate to HDFS(kinit something), then restart the worker processes.
It works for now, but it means that Dask isn't compatible with a Kerberised cluster. Using the configuration we made, all the users have the same permissions on HDFS when launching a computation from the worker. I think this is not a totally safe practice

Sevy
  • 15
  • 2
  • 6