I'm using package mrjob
on Python3.7
recently. I started hadoop and created an wordaccount.py
file, which can calculate the frequency of each word in an .txt
file. When I tried to run the file through python3 wordaccount.py -r hadoop data/hamlet.txt>1.txt
, here comes some problems:
xjj@master:/usr/local/hadoop/pyhadoop$ python3 wordaccount.py -r hadoop data/hamlet.txt>1.txt
No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in /usr/local/hadoop/bin...
Found hadoop binary: /usr/local/hadoop/bin/hadoop
Using Hadoop version 2.7.1
Looking for Hadoop streaming jar in /usr/local/hadoop...
Found Hadoop streaming jar: /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar
Creating temp directory /tmp/wordaccount.xjj.20220522.085604.327723
uploading working dir files to hdfs:///user/xjj/tmp/mrjob/wordaccount.xjj.20220522.085604.327723/files/wd...
STDERR: 22/05/22 16:56:07 WARN hdfs.DFSClient: DataStreamer Exception
STDERR: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/xjj/tmp/mrjob/wordaccount.xjj.20220522.085604.327723/files/wd/mrjob.zip._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation.
STDERR: at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
STDERR: at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3110)
STDERR: at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3034)
STDERR: at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:723)
STDERR: at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
STDERR: at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
STDERR: at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
STDERR: at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
STDERR: at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
STDERR: at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
STDERR: at java.security.AccessController.doPrivileged(Native Method)
STDERR: at javax.security.auth.Subject.doAs(Subject.java:422)
STDERR: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
STDERR: at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
STDERR:
STDERR: at org.apache.hadoop.ipc.Client.call(Client.java:1476)
STDERR: at org.apache.hadoop.ipc.Client.call(Client.java:1407)
STDERR: at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
STDERR: at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
STDERR: at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418)
STDERR: at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
STDERR: at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
STDERR: at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
STDERR: at java.lang.reflect.Method.invoke(Method.java:498)
STDERR: at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
STDERR: at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
STDERR: at com.sun.proxy.$Proxy10.addBlock(Unknown Source)
STDERR: at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1430)
STDERR: at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1226)
STDERR: at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
STDERR: put: File /user/xjj/tmp/mrjob/wordaccount.xjj.20220522.085604.327723/files/wd/mrjob.zip._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation.
Traceback (most recent call last):
File "wordaccount.py", line 19, in <module>
WordCount.run()
File "/home/xjj/anaconda3/lib/python3.7/site-packages/mrjob/job.py", line 616, in run
cls().execute()
File "/home/xjj/anaconda3/lib/python3.7/site-packages/mrjob/job.py", line 687, in execute
self.run_job()
File "/home/xjj/anaconda3/lib/python3.7/site-packages/mrjob/job.py", line 636, in run_job
runner.run()
File "/home/xjj/anaconda3/lib/python3.7/site-packages/mrjob/runner.py", line 503, in run
self._run()
File "/home/xjj/anaconda3/lib/python3.7/site-packages/mrjob/hadoop.py", line 328, in _run
self._upload_local_files()
File "/home/xjj/anaconda3/lib/python3.7/site-packages/mrjob/runner.py", line 1156, in _upload_local_files
self._copy_files_to_wd_mirror()
File "/home/xjj/anaconda3/lib/python3.7/site-packages/mrjob/runner.py", line 1257, in _copy_files_to_wd_mirror
self._copy_file_to_wd_mirror(path, name)
File "/home/xjj/anaconda3/lib/python3.7/site-packages/mrjob/runner.py", line 1238, in _copy_file_to_wd_mirror
self.fs.put(path, dest)
File "/home/xjj/anaconda3/lib/python3.7/site-packages/mrjob/fs/composite.py", line 151, in put
return self._handle('put', path, src, path)
File "/home/xjj/anaconda3/lib/python3.7/site-packages/mrjob/fs/composite.py", line 110, in _handle
return getattr(fs, name)(*args, **kwargs)
File "/home/xjj/anaconda3/lib/python3.7/site-packages/mrjob/fs/hadoop.py", line 321, in put
self.invoke_hadoop(['fs', '-put', src, path])
File "/home/xjj/anaconda3/lib/python3.7/site-packages/mrjob/fs/hadoop.py", line 183, in invoke_hadoop
raise CalledProcessError(proc.returncode, args)
subprocess.CalledProcessError: Command '['/usr/local/hadoop/bin/hadoop', 'fs', '-put', '/tmp/wordaccount.xjj.20220522.085604.327723/mrjob.zip', 'hdfs:///user/xjj/tmp/mrjob/wordaccount.xjj.20220522.085604.327723/files/wd/mrjob.zip']' returned non-zero exit status 1.
The content in wordaccount.py
is as following. The function of it is calculating the frequency of every word occurred.
import os
import sys
from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.protocol import RawValueProtocol,JSONProtocol,ReprProtocol
import traceback
class WordCount(MRJob):
def mapper(self, _, line):
linearry = line.split()
for word in linearry:
yield word, 1
def reducer(self, key, value):
yield key,sum(value)
if __name__ == '__main__':
WordCount.run()
I'm sure I have started hadoop through sbin/start-all.sh
, and wordaccount.py
and hamlet.txt
actually exists. According to the Trackback, I have found the hadoop.py
. So what changes should I make? Thanks.