0

My OS is Windows 10 64 bit and I use Anaconda 3.8 64 bit. I try to develop Hadoop File System 3.3 client with PyArrow module. Installation of PyArrow with conda on windows 10 is successful.

> conda install -c conda-forge pyarrow

But connection of hdfs 3.3 with pyarrow throws errors like below,

import pyarrow as pa
fs = pa.hdfs.connect(host='localhost', port=9000)

The errors are

Traceback (most recent call last):
  File "C:\eclipse-workspace\PythonFredProj\com\aaa\fred\hdfs3-test.py", line 14, in <module>
    fs = pa.hdfs.connect(host='localhost', port=9000)
  File "C:\Python-3.8.3-x64\lib\site-packages\pyarrow\hdfs.py", line 208, in connect
    fs = HadoopFileSystem(host=host, port=port, user=user,
  File "C:\Python-3.8.3-x64\lib\site-packages\pyarrow\hdfs.py", line 38, in __init__
    _maybe_set_hadoop_classpath()
  File "C:\Python-3.8.3-x64\lib\site-packages\pyarrow\hdfs.py", line 136, in _maybe_set_hadoop_classpath
    classpath = _hadoop_classpath_glob(hadoop_bin)
  File "C:\Python-3.8.3-x64\lib\site-packages\pyarrow\hdfs.py", line 163, in _hadoop_classpath_glob
    return subprocess.check_output(hadoop_classpath_args)
  File "C:\Python-3.8.3-x64\lib\subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "C:\Python-3.8.3-x64\lib\subprocess.py", line 489, in run
    with Popen(*popenargs, **kwargs) as process:
  File "C:\Python-3.8.3-x64\lib\subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Python-3.8.3-x64\lib\subprocess.py", line 1307, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
OSError: [WinError 193] %1 is not a valid win32 application

I install the Visual C++ 2015 on Windows 10. But the same errors are still shown.

halfer
  • 19,824
  • 17
  • 99
  • 186
Joseph Hwang
  • 1,337
  • 3
  • 38
  • 67
  • 1
    I solved it. I installed and uninstalled anaconda lots of time on my virtual machine and I think those would bring these errors. I removed all Windows 10 on virtual machine and made virtual machine again. And pyarrow works with no errors. Thanks any way. – Joseph Hwang Oct 21 '20 at 10:08
  • Would it be worth writing an answer for the benefit of future readers, Joseph? Or did you merely reinstall and can't explain why that fixed it? I wonder if it is worth closing this question. – halfer Mar 02 '21 at 08:46

1 Answers1

0

This is my solution.

  1. Before starting the pyarrow, Hadoop 3 has to be installed on your windows 10 64 bit. and the installation path has to be set on Path

  2. install pyarrow 3.0 (version is important. have to be 3.0)

    pip install pyarrow==3.0

  3. create PyDev module on eclipse PyDev perspective. The sample codes are like below

    from pyarrow import fs

    hadoop = fs.HadoopFileSystem("localhost", port=9000) print(hadoop.get_file_info('/'))

  4. choose your created pydev module and click the [Properties (Alt + Enter)]

  5. Click the [Run/Debug Settings]. Choose the the pydev module and [Edit] button. enter image description here

  6. In [Edit Configuration] window, select the [Environment] tab enter image description here

  7. Click [Add] button

  8. You have to make 2 Environment Variables. "CLASSPATH" and "LD_LIBRARY_PATH"

  1. CLASSPATH : In command prompt, execute the following command.
hdfs classpath --glob

copy the returned values and paste them into Value text field (The retured values are long string value. but copy them all)

enter image description here

  1. LD_LIBRARY_PATH : Insert the path of libhdfs.so file on hadoop 3, In my case "C:\hadoop-3.3.0\lib\native" into Value text field.

enter image description here

enter image description here

  1. Ok! the pyarrow 3.0 configuration is set. You can connect the hadoop 3.0 on windows 10 eclipse PyDev.
Joseph Hwang
  • 1,337
  • 3
  • 38
  • 67