Reading csv file from hdfs using dask and pyarrow

Question

We are trying out dask_yarn version 0.3.0 (with dask 0.18.2) because of the conflicts between the boost-cpp i'm running with pyarrow version 0.10.0
We are trying to read a csv file from hdfs - however we get an error when running dd.read_csv('hdfs:///path/to/file.csv') since it is trying to use hdfs3.

ImportError: Can not find the shared library: libhdfs3.so

From the documentation it seems that there is an option to use pyarrow .

What is the correct syntax/configuration to do so?

This appears to be a duplicate of https://github.com/dask/dask-yarn/issues/20 . In the future we would appreciate it if you did not duplicate questions. — MRocklin, Sep 06 '18 at 14:07

tslmy · Answer 1 · 2020-07-06T04:39:45.747

Try finding the file using locate -l 1 libhdfs.so. In my case, the file is located under /opt/mapr/hadoop/hadoop-0.20.2/c++/Linux-amd64-64/lib.

Then, restart your Jupyter server with the environment variable ARROW_LIBHDFS_DIR set to this path. In my case, my command looks like:

ARROW_LIBHDFS_DIR=/opt/mapr/hadoop/hadoop-0.20.2/c++/Linux-amd64-64/lib jupyter lab --port 2250 --no-browser

Next, when you create the Yarn Cluster, pass this variable as a worker parameter:

# Create a cluster where each worker has two cores and eight GiB of memory
cluster = YarnCluster(
    worker_env={
        # See https://github.com/dask/dask-yarn/pull/30#issuecomment-434001858
        'ARROW_LIBHDFS_DIR': '/opt/mapr/hadoop/hadoop-0.20.2/c++/Linux-amd64-64/lib',
    },
)

This solved the problem for me.

(Inspired by https://gist.github.com/priancho/357022fbe63fae8b097a563e43dd885b)

Reading csv file from hdfs using dask and pyarrow

1 Answers1

Linked