I am trying to create a file/directory in HDFS using python. To be clear, I am running a Hadoop streaming job with mapper written in Python. This mapper is actually trying to create a file in HDFS. I read that there are several Python frameworks to do this, but my interest is to go for Hadoop streaming. So, is there any way in Hadoop streaming to accomplish this?.
Asked
Active
Viewed 3,037 times
4 Answers
1
You Can run command HDFS in script python
import sys, subprocess
def run_cmd(args_list):
proc = subprocess.Popen(args_list, stdout=subprocess.PIPE,stderr=subprocess.PIPE)
(output, errors) = proc.communicate()
if proc.returncode:
raise RuntimeError('Error run_cmd')
return (output, errors)
And run
(out, errors)=run_cmd(['hdfs','dfs','-mkdir','%s' %apth_HDFS_to_create_folder])

Diego Mosquera Prada
- 11
- 1
- 3
0
there is no way to create file with python script, but it's possible to create directory using pydoop or snakebit
see : https://www.geeksforgeeks.org/creating-files-in-hdfs-using-python-snakebite/

Zak_Stack
- 103
- 8
-
No way to create file? https://hdfscli.readthedocs.io/en/latest/quickstart.html#reading-and-writing-files – OneCricketeer Sep 22 '22 at 20:21
-
yes it is possible to create file using: (ret, out, err)= run_cmd(['hdfs', 'dfs', '-touchz', filename]) – Zak_Stack Sep 30 '22 at 13:34
-
Yes, but no. It's possible with `pip install hdfs` **not subprocess** - https://pypi.org/project/hdfs/ – OneCricketeer Sep 30 '22 at 16:31
-
it's not about that – Zak_Stack Oct 03 '22 at 09:44
0
Solution using supprocess
inspired by this answer in the "Create HDFS file" question.
from subprocess import Popen, PIPE
(ret, out, err) = run_cmd(['hdfs', 'dfs', '-touchz', '/directory/filename'])
0
it is possible to create file using:
#define run commande function which run hadoop native linux cmd
def run_cmd(args_list):
"""
run linux commands
"""
# import subprocess
print('Running system command: {0}'.format(' '.join(args_list)))
proc = Popen(args_list, stdout=PIPE, stderr=PIPE)
s_output, s_err = proc.communicate()
s_return = proc.returncode
return s_return, s_output, s_err
(ret, out, err)= run_cmd(['hdfs', 'dfs', '-touchz', filename])

Zak_Stack
- 103
- 8
-
1Please edit your [other answer](https://stackoverflow.com/a/73828803/2308683)(s) rather than post multiple different ones – OneCricketeer Sep 30 '22 at 16:30