0

I'm working on an HDP cluster and I'm trying to read a .csv file from HDFS using pyarrow. I am able to connect to hdfs and print information about the file using the info() function. But when it comes to reading the content of the file, I get a pyarrow.lib.ArrowIOError. What could be the source of the problem?

Here is the code I am executing

# IMPORTS
import pyarrow as pa
from pyarrow import csv
import os
import subprocess

# GET HDFS CLASSPATH
classpath = subprocess.Popen(["/usr/hdp/current/hadoop-client/bin/hdfs", "classpath", "--glob"], stdout=subprocess.PIPE).communicate()[0]

# CONFIGURE ENVIRONMENT VARIABLES
os.environ["HADOOP_HOME"] = "/usr/hdp/current/hadoop-client"
os.environ["JAVA_HOME"] = "/home/G60070/installs/jdk1.8.0_201/"
os.environ["CLASSPATH"] = classpath.decode("utf-8")
os.environ["ARROW_LIBHDFS_DIR"] = "/usr/hdp/2.6.5.0-292/usr/lib/"

# USING PYARROW
## connect to hdfs
fs = pa.hdfs.connect("xxxxxxx.xxx.xxx.fr", 8020)
file = 'hdfs://xxxxxxx.xxx.xxx.fr:8020/user/F43479/trip_data_v2.csv'
print(str(fs.info(file))) # this instruction works well

## read csv file
csv_file = csv.read_csv(file) # this one doesn't work as expected
csv_file

According to the pyarrow documentation, I'm supposed to get the list of the csv's columns as a result.

But I'm getting this error: pyarrow.lib.ArrowIOError: Failed to open local file: hdfs://xxxxxxx.xxx.xxx.fr:8020/user/F43479/trip_data_v2.csv, error: file not found

First, I thought I miswrote the filepath. I checked hdfs and the file is there.

[F43479@xxxxx dask_tests]$ hdfs dfs -ls /user/F43479/
Found 9 items
-rw-r-----   3 F43479 hdfs            0 2019-03-07 16:42 /user/F43479/-
drwx------   - F43479 hdfs            0 2019-04-03 02:00 /user/F43479/.Trash
drwxr-x---   - F43479 hdfs            0 2019-03-13 16:53 /user/F43479/.hiveJars
drwxr-x---   - F43479 hdfs            0 2019-03-13 16:52 /user/F43479/hive
drwxr-x---   - F43479 hdfs            0 2019-03-15 13:23 /user/F43479/nyctaxi_trip_data
-rw-r-----   3 F43479 hdfs           36 2019-04-15 11:13 /user/F43479/test.csv
-rw-r-----   3 F43479 hdfs  50486731416 2019-03-26 17:37 /user/F43479/trip_data.csv
-rw-r-----   3 F43479 hdfs   5097056230 2019-04-15 13:57 /user/F43479/trip_data_v2.csv
-rw-r-----   3 F43479 hdfs 504867312828 2019-04-02 11:15 /user/F43479/trip_data_x10.csv

What could be the source of the problem?

Thanks for your potential help.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Sevy
  • 15
  • 2
  • 6

1 Answers1

1

Try to open the file through HadoopFileSystem object:

with fs.open(file, 'rb') as f:
    ## read csv file
    csv_file = csv.read_csv(f) 
isalgueiro
  • 1,973
  • 16
  • 20
  • I went to see the [documentation](https://arrow.apache.org/docs/python/filesystems.html#hadoop-file-system-hdfs) about **fs.open()**. What I understood is that this function uses a **C++-based interface** enabling pyarrow to connect to **Hadoop File System**. But, by default pyarrow uses **libhdfs**, a JNI-based Interface to Java Hadoop Client, to interact with HDFS. The path to **libhdfs** is defined in the env variable **LD_LIBRARY_PATH** (which I had configured). Did my error occur becauz pyarrow wasn't able to access hdfs using its default behaviour? – Sevy Apr 16 '19 at 16:13
  • In `csv.read_csv(file)` the `file` variable is the string `'hdfs://xxxxxxx.xxx.xxx.fr:8020/user/F43479/trip_data_v2.csv'`. While you had created the `fs` filesystem object `csv.read_csv` is not aware of it. I think the error message could be better – Wes McKinney Apr 23 '19 at 20:08